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EXECUTIVE  SUMMARY 

Steganography  is  a  relatively  new  mode  of  communication  that  is  far  less  developed  and  understood  in  comparison 
to  other  research  fields  that  offer  privacy  and  security,  such  as  cryptography.  In  particular,  despite  its  fundamental 
importance,  the  very  basic  question  of  how  big  a  payload  it  is  secure  to  embed  in  a  given  cover  object  to  prevent  an 
adversary  from  detecting  the  presence  of  a  secret  message,  is  still  open.  The  first  result  in  this  direction  appeared 
in  [61,  88,  89].  Through  a  series  of  articles  published  in  2004  -2008,  it  has  been  established  that  steganographic  capacity 
of  perfectly  secure  stegosystems  grows  linearly  with  the  number  of  cover  elements  (pixels)  -  secure  steganography  has  a 
positive  rate.  In  practice,  however,  neither  the  adversary  (Warden)  nor  the  steganographer  has  perfect  knowledge  of  the 
cover  source  and  thus  it  is  unlikely  that  perfectly  secure  stegosystems  for  empirical  covers,  such  as  digital  media,  will 
ever  be  constructed.  This  justifies  the  first  topic  on  which  the  PI  focused  -  the  study  of  secure  capacity  of  imperfect 
stegosystems. 

In  2006  and  2007,  theoretical  results  appeared  concerning  the  paradigm  of  batch  steganography  [50,  51]  (embedding  by 
dividing  payload  among  multiple  covers  to  minimize  statistical  detectability).  These  results,  supported  by  experiments 
with  blind  steganalyzers,  pointed  to  an  emerging  paradigm:  whether  steganography  is  performed  in  a  large  batch  of  cover 
objects  or  in  a  single  large  object,  the  size  of  the  secure  payload  grows  only  sub-linearly.  In  particular,  the  secure  payload 
appeared  to  be  proportional  to  the  square  root  of  the  number  of  pixels  in  the  cover  image.  This  so-called  Square-Root  Law 
of  Steganography  is  the  first  principal  achievement  reported  here.  In  Section  1,  it  is  formally  established  for  imperfect 
stegosystems  that  hide  messages  in  covers  modeled  as  a  stationary  Markov  chain  and  when  the  embedding  changes  are 
mutually  independent.  An  important  new  part  of  this  contribution,  explained  in  Section  (2),  is  a  complete  characterization 
of  perfeetly-secure  cover  sources  with  respect  to  a  given  embedding  operation.  The  square  root  law  has  been  confirmed 
experimentally  on  real  imagery  and  several  different  embedding  and  steganalysis  methods  in  Section  3.  Among  other 
contributions,  it  has  been  established  that  secure  payload  of  imperfect  steganographic  systems  is  completely  described 
using  the  so-called  root  rate  that  depends  on  the  steganographic  Fisher  information,  which  forms  a  security  descriptor 
equivalent  to  the  Kullback-Leibler  divergence  (Section  2).  The  Fisher  information  can  be  used  for  optimizing  the  design 
of  steganographic  systems,  for  benchmarking,  and  comparison  of  covert  communications  schemes. 

Having  established  the  fundamental  limitations  of  covert  communication  in  empirical  (incognizable)  covers,  the  PI  then 
directed  her  effort  towards  a  general  framework  within  which  steganographic  schemes  can  be  built  in  practice  (Sections  5- 
6).  By  abandoning  the  concept  of  preserving  the  cover  distribution  (as  the  distribution  is  incognizable  anyway),  the 
steganography  is  instead  casted  as  a  source  coding  with  fidelity  constraint.  The  so-called  minimal  embedding  impact 
steganography  is  formulated  using  the  concept  of  distortion  that  is  fundamentally  tied  to  statistical  detectability.  The 
PI  resolved  the  problem  of  embedding  while  minimizing  an  essentially  arbitrary  distortion  function  by  formulating  the 
embedding  problem  within  statistical  physics.  The  resulting  “Gibbs  construction”  (Section  6)  is  an  elegant  framework 
within  which  one  can  study,  design,  and  optimize  steganographic  schemes.  Syndrome-trellis  codes  were  proposed  as  a 
general  approach  to  practical  embedding.  Finally,  the  PI  managed  to  tie  distortion  to  statistical  detectability  by  converting 
the  problem  of  minimizing  detection  to  a  parameter  optimization  problem  [24].  Thus,  the  developed  framework  provides 
an  enclosed  and  complete  theoretical  basis  for  further  development  of  steganography  in  empirical  covers. 

This  report  makes  use  of  the  Iverson  bracket  [/]  defined  to  be  1  if  the  logical  expression  I  is  true  and  zero  otherwise. 
The  binary  entropy  function  h(x)  =  — xlogx  —  (1  —  x)  log(l  -  x)  is  expressed  in  bits. 
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1.  The  Square  Root  Law  of  Secure  Steganographic  Payload 

In  steganography,  the  sender  communicates  with  the  receiver  by  hiding  her  messages  inside  innocuous  looking  (cover) 
objects.  Most  practical  steganographic  methods  embed  messages  by  slightly  modifying  individual  elements  of  the  cover, 
obtaining  thus  the  modified  stego  object  that  conveys  the  hidden  message.  The  goal  here  is  to  make  the  stego  objects 
statistically  indistinguishable  from  covers  -  a  passive  warden,  who  is  merely  inspecting  the  traffic,  cannot  construct  a 
detector  of  stego  objects  that  would  work  better  than  an  algorithm  that  makes  random  guesses.  The  assumption  is 
that,  up  to  a  secret  shared  key,  the  warden  is  familiar  with  all  details  of  the  steganographic  scheme:  this  is  the  so-called 
Kerckhoffs*  principle,  which  is  also  interpreted  to  mean  that  the  warden  has  complete  knowledge  of  the  probabilistic 
distribution  of  cover  objects. 

Statistical  detectability  of  embedding  changes  depends  on  their  character  and  extent.  Intuitively,  it  should  be  possible 
to  send  short  messages  with  lower  risk  of  being  detected  than  long  messages.  From  a  practical  point  of  view,  the  sender 
needs  to  know  how  long  a  message  she  can  embed  for  a  chosen  risk  -  she  needs  to  know  the  steganographic  capacity  of 
the  stegosystem.  Unfortunately,  determining  the  steganographic  capacity  analytically  for  real  digital  media  objects,  such 
as  digital  images,  is  very  difficult  even  for  the  simplest  steganographic  paradigms,  such  as  LSB  (Least  Significant  Bit) 
embedding.  The  reason  is  the  lack  of  accurate  statistical  models. 

One  may  intuitively  expect  the  steganographic  capacity  to  be  linear  in  the  size  of  the  cover  object  by  referring  to  a 
similar  result  for  capacity  of  noisy  communication  channels.  This  is,  indeed,  valid  if  the  stegosystem  is  perfectly  secure, 
since  there  is  no  possible  detector  [61,  12].  In  view  of  the  absence  of  provably  secure  steganographic  methods  for  real 
digital  media,  it  makes  sense  to  investigate  steganographic  capacity  of  imperfect  embedding  methods  for  which  detectors 
exist  and  inquire  about  the  largest  payload  that  can  be  embedded  using  their  e-secure  versions  in  the  sense  of  Cachin  [9]. 

The  fact  that  steganographic  capacity  is  most  likely  sublinear  was  already  suspected  by  Anderson  [1]  in  1996: 

“Thanks  to  the  Central  Limit  Theorem,  the  more  covertext  one  gives  the  Warden,  the  better  he  will  be 
able  to  estimate  its  statistics,  and  so  the  smaller  the  rate  at  which  [the  steganographer]  will  be  able  to 
tweak  bits  safely.  The  rate  might  even  tend  to  zero... 77 

The  analysis  of  batch  steganography  and  pooled  steganalysis  by  Ker  [51]  tells  us  that  steganographic  capacity  of  imperfect 
stegosystems  only  grows  as  the  square  root  of  the  number  of  communicated  covers.  This  result  could  be  interpreted  as  the 
square  root  capacity  law  for  a  single  image  by  dividing  it  into  smaller  blocks.  The  capacity  result,  however,  was  obtained 
with  the  assumption  that  the  individual  images  (blocks)  form  a  sequence  of  independent  random  variables,  which  is  clearly 
false  not  only  for  images  but  also  other  digital  media  files.  The  main  contribution  reported  in  this  section  is  to  establish 
the  same  law  for  the  simplest  form  of  dependence  that  enables  analytical  reasoning  -  it  will  be  assumed  that  individual 
elements  of  the  cover  (e.g.,  pixels)  follow  stationary  Markov  chain. 


1.1.  Basic  assumptions.  This  section  introduces  notation  and  three  basic  assumptions  under  which  the  SRL  (Square- 
Root  Law)  is  proved.  The  first  assumption  concerns  the  impact  of  embedding.  It  is  postulated  that  the  stego  object  is 
obtained  by  applying  a  mutually  independent  embedding  operation  to  each  cover  element.  This  type  of  embedding  can 
be  found  in  majority  of  practical  embedding  methods  (see,  e.g.,  [37]  and  the  references  therein).  The  second  assumption 
concerns  the  model  of  covers.  The  individual  cover  elements  are  required  to  form  a  first-order  Markov  chain  because 
this  model  is  analytically  tractable  while  allowing  study  of  more  realistic  cover  sources  with  memory.  Finally,  the  third 
assumption  essentially  states  that  the  steganographic  method  is  not  perfectly  secure. 

Throughout  the  report,  A  =  (a^)  denotes  a  matrix  with  elements  a ij,  calligraphic  font  (X)  to  denote  sets,  and  capital 
letters  (X,  Y)  to  denote  random  variables,  both  vector  and  scalar.  If  y  is  a  vector  with  components  y  =  (yi, . . . ,  yn),  ylk 
denotes  the  subsequence  ylk  =  (y*, . . . ,  yi ).  If  Y  =  (Yi, . . . ,  Yn)  is  a  random  vector  with  underlying  probability  distribution 
P,  then  P(Yk  =*  ylk)  denotes  the  marginal  probability  P(Yk  =  y Y*+ 1  =  y^+i, . . . ,  Yi  =  yj). 

An  n-element  cover  source  will  be  represented  using  a  random  variable  Xf  ==  (Xj,..  .,Xn)  distributed  according  to 
some  general  distribution  P(n^  over  Xn,  X  =  {1  ,...,X}.  A  specific  cover  object  is  a  realization  of  XJ*  and  will  be 
denoted  with  the  corresponding  lower  case  letter  ar?  =  (xi,...,xn)  €  Xn.  A  stegosystem,  with  covers  of  fixed  size 
n,  is  a  triple  Sn  =  (X[\$(n\  $dn))  consisting  of  the  random  variable  describing  the  cover  source,  embedding  mapping 
and  extraction  mapping  The  embedding  mapping  applied  to  XJ1  induces  another  random  variable 

y,n  =  (Ti,...,  Yn)  with  probability  distribution  Q ^  over  Xn .  Specific  realizations  of  Yxn  are  called  stego  objects  and  will 
be  denoted  y£  =  (yi,  • .  - , yn)*  Here,  fi  >  0  is  a  scalar  parameter  of  embedding  whose  meaning  will  be  explained  shortly. 
The  specific  details  of  the  embedding  (and  extraction)  mappings  are  immaterial  for  this  study.  In  particular,  one  only 
needs  to  postulate  the  probabilistic  impact  of  embedding. 
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FIGURE  1.1.  Examples  of  several  embedding  methods  in  the  form  of  a  functional  matrix  B. 


Assumption  1:  [Mutually  independent  embedding]  The  embedding  algorithm  visits  every  cover  element  Xk  and, 
independently  of  all  other  elements,  modifies  it  to  a  corresponding  element  of  the  stego  object  Yk  with  probability 


(1.1) 


Qp(Yk  =  j\Xk=i)±bij(l3) 


1  +  ySci.i  if  i  =  j 
ficij  otherwise, 


for  some  constants  ( Cij )  with  c^j  >  0  for  i  ^  j.  Note  that  because  Yljex  bi,j  =  1,  one  must  have  Ci,t  =  —  f°r 

each  i  £  X.  The  matrix  (c*,j)  reflects  the  inner  workings  of  the  embedding  algorithm,  while  the  parameter  $  captures 
the  extent  of  embedding  changes.  It  will  be  useful  to  think  of  /?  as  the  relative  number  of  changes  (change  rate)  or  some 
function  of  the  change  rate.  Also  note  that  one  can  find  sufficiently  small  fo,  such  that  b^i(P)  >  0  for  /?  E  [0,  A)]  and  all 
i€*. 

Because  the  matrix  Bp  =  (bij(fi))  does  not  depend  on  k  G  {1, . . . ,  n}  or  the  history  of  embedding  changes,  one  can  say 
that  the  stego  object  is  obtained  from  the  cover  by  applying  to  each  cover  element  a  Mutually  Independent  embedding 
operation  (one  speaks  of  MI  embedding).  The  independence  of  embedding  modifications  implies  that  the  conditional 
probability  of  stego  object  given  the  cover  object  can  be  factorized,  i.e.,  *G/in|A’f )  =  n”=i  Qp(Yi\Xi)- 

Many  embedding  algorithms  across  different  domains  use  MI  embedding.  Representative  examples  are  LSB  embedding, 
±1  embedding,  stochastic  modulation,  Jsteg,  MMx,  and  various  versions  of  the  F5  algorithm  [29].  Examples  of  the  matrix 
B^  for  three  selected  embedding  methods  are  showm  in  Figure  1.1. 

Next,  an  assumption  on  the  cover  source  is  formulated. 

Assumption  2:  [Markov  cover  source]  It  is  assumed  that  the  cover  source  X[*  is  a  first-order  stationary  Markov 
Chain  over  X,  which  will  often  be  abbreviated  as  simply  Markov  Chain  (MC).  This  source  is  completely  described 
by  its  stochastic  transition  probability  matrix  A  =  (a*y)  €  RNxN>  a*j  =  Pr(Xk  =  j\Xk-i  =  i),  and  by  the  initial 
distribution  Pr(Xi).  The  probability'  distribution  induced  by  the  MC  source  generating  n-element  cover  objects  satisfies 
p(n)(Xi  =  Xi)  =  p(n~1)(XJl“*1  =  x^“1)aXn-lXn,  where  P^^(Xi)  is  the  initial  distribution.  Furthermore,  it  is  assumed 
that  the  transition  probability  matrix  of  the  cover  source  satisfies  a ij  >  6  >  0,  for  some  S  and  thus  the  MC  is  irreducible. 
The  stationary  distribution  of  the  MC  source  is  a  vector  n  =  (7Ti, . . .  ,7nv)  satisfying  ttA  =  tt.  In  this  report,  it  will 
always  be  assumed  that  the  initial  distribution  P^^(Xi)  =  7 r,  which  implies  P^(Xk)  =  7 r  for  every  n  and  k.  This 
assumption  simplifies  the  analysis  without  loss  of  generality  because  the  marginal  probabilities  P(n)(X*)  converge  to  7r 
with  exponential  rate  w.r.t.  k  (see  Doob  [17],  equation  (2.2)  on  page  173).  In  other  words,  MCs  are  “forgetting”  their 
initial  distribution  with  exponential  rate. 

Under  the  above  assumption  and  the  class  of  MI  embedding,  the  source  of  stego  objects  no  longer  exhibits  the  Markov 
property  and  forms  a  Hidden  Markov  Chain  (HMC)  instead  [79].  The  HMC  model  is  described  by  its  hidden  states  (cover 
elements)  and  output  transition  probabilities  (MI  embedding).  Hidden  states  are  described  by  the  cover  MC  and  the 
output  probability  transition  matrix  B  is  taken  from  the  definition  of  MI  embedding. 

Unless  stated  otherwise,  in  the  rest  of  this  report  Qp1*  denotes  the  probability  measure  induced  by  the  HMC  source 
embedded  with  parameter  /?  into  n-element  MC  cover  objects.  By  the  stationarity  of  the  MC  source,  the  marginal 
probabilities  pW(X£+l)  =  P&\X f)  and  Q$\y£+1)  =  Q^(Y j2)  for  all  k.  Sometimes  the  number  of  elements,  n,  will 
be  omitted  and  P  and  Qp  will  denote  the  probability  distribution  over  cover  and  stego  objects,  respectively. 

The  third  assumption  concerns  the  entire  stegosystem  Sn.  Because  it  is  known  [61,  12]  that  steganographic  capacity  of 
perfectly  secure  stegosystems  is  linear  in  n,  the  SRL  can  only  apply  to  imperfect  stegosystem s. 
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Assumption  3:  [FI  condition]  It  is  assumed  that  the  stegosystem  Sn  =  n),$(n))  is  not  perfectly  secure  in 

the  sense  of  Cachin  [9]  (the  KL  divergence  Dkl(P^WQ^)  >  0).  It  is  shown  in  Section  (2),  that  for  the  special  case  of 
Markov  cover  sources  X”  and  MI  embedding  this  assumption  can  be  equivalently  stated  in  two  different  forms: 

(1)  The  pair  (P^.Q^)  does  not  satisfy  so  called  Fisher  Information  condition , 

(1.2)  Vj/J  e  x"1  (p«>(x?  =  »?)  >  o)  =»  U_„  =  °) • 


(2)  There  exists  a  pair  of  states  (i,  j)  such  that 


(1.3) 


P(X l  =  (i,  j))  ^  Q0(V ?  =  (i,  j))  for  all  /9>0. 


For  the  sake  of  continuity,  the  PI  only  provides  a  few  brief  arguments.  First  of  all,  perfectly  secure  stegosystem s  must 
satisfy  (1.2)  because  the  Fisher  information 


1(0)  =  EP 


appears  as  a  coefficient  in  front  of  /32  in  the  Taylor  expansion  of  KL  divergence  D/cl(^(2)||Q^2)  )  w.r.t.  ft  and  thus 
l^-o  must  be  zero  whenever  P^2\y J)  >  0.  The  opposite  implication  (zero  Fisher  information  implies  zero  KL 
divergence)  is  not  valid  in  general  but  holds  for  MI  embedding  as  shown  in  Section  (2).  The  second  condition  follows  from 
the  fact  that  second-order  marginal  statistics  fully  describe  the  first-order  MC  process  and  thus  if  (1.3)  does  not  hold, 
then  both  cover  and  stego  distributions  are  the  same  for  all  n  (the  stegosystem  is  perfectly  secure). 

Finally,  it  should  be  stressed  that  Assumptions  1-3  are  not  overly  restrictive  and  will  likely  be  satisfied  for  all  practical 
steganographic  schemes  in  some  appropriate  representation  of  the  cover.  For  example,  in  digital  images  it  is  unlikely  that 
the  distribution  of  each  pixel  depends  only  on  its  neighbor,  but  the  dependency  is  likely  to  be  spatially-limited.  Then  the 
image  can  be  modeled  as  a  Markov  chain  made  up  of  overlapping  pixel  groups.  Furthermore,  if  a  stegosystem  preserves 
the  first-order  statistics  of  a  cover  source,  it  is  likely  to  be  detectable  by  considering  higher-order  dependencies:  the 
apparently-perfect  stegosystem  becomes  imperfect  when  the  cover  is  represented  by  pairs  or  groups  of  pixels,  coefficients, 
or  some  other  derived  quantities. 


1.2.  The  square  root  law  of  steganographic  capacity.  This  section  contains  the  formulation  and  the  proof  of  the 
main  result,  which  states  that  the  steganographic  capacity  of  imperfect  stegosystems  with  Markov  covers  and  mutually 
independent  embedding  operation  only  grows  with  the  square  root  of  the  number  of  cover  elements.  This  finding  has  some 
fundamental  implications  in  steganography  and  steganalysis.  Probably  the  most  remarkable  one  is  that  steganographic 
capacity  exhibits  quite  different  properties  when  compared  with  capacity  of  noisy  channels  or  lossless  compression.  For 
example,  while  a  mismatch  in  source  model  decreases  the  compression  gain  by  a  constant  (the  KL  divergence  between  the 
source  model  and  true  source  distribution),  a  cover  model  mismatch  in  steganography  leads  to  vanishing  capacity. 

The  steganographer  is  at  risk  (w.r.t.  some  fixed  tuple  (PpA>  Pmd)'  with  0  <  P’fa  <  1  and  0  <  P*md  <  1  “  P'fa) 
the  warden  has  a  detector  with  probability  of  false  alarms  and  missed  detection  Pfa^Pmd  satisfying  Pfa  <  PpA  and 
Pmd  <  Pmd- 

Theorem  Is  [The  square  root  law  of  steganography  for  Markov  covers]  For  a  sequence  of  stegosystems  (Sn)^! 
satisfying  Assumptions  1-S,  the  following  holds: 

(1)  If  the  sequence  of  embedding  parameters  fi(n)  increases  faster  than  l/y/n  in  the  sense  that  lim^oo  =  oo, 
thenT  for  sufficiently  large  n,  the  Steganographer  is  at  risk  for  arbitrary  tuple  (PpA,  Pmd)- 

(2)  If  &(n)  increases  slower  than  1/ y/n,  lim^oo  f-jfy  =  0,  then  the  stegosystem  can  be  made  e-secure  for  any  e  >  0 
for  sufficiently  large  n.  This  implies  that  the  Steganographer  is  not  at  risk ,  for  any  tuple  (PpA ,  Pmd)' 

(3)  Finally ,  if  ffin)  grows  asymptotically  as  fast  as  l/y/n ,  limn^oo  =  e  for  some  0  <  e  <  oo:  then  the  stegosystem 
is  asymptotically  Ct2 -secure  for  some  constant  C . 

Proof:  Each  part  of  the  theorem  is  proved  separately.  From  the  Kerckhoffs’  principle,  the  warden  knows  the  distribution 
of  cover  objects  P^  = 

Part  1  [Steganographer  at  risk]  Here,  one  needs  to  prove  that  the  Steganographer  is  at  risk  w.r.t.  any  (Pp^Plij/)  for 
all  sufficiently  large  n.  This  means  that  a  sequence  of  detectors,  Dn,  needs  to  be  constructed  for  the  following  composite 
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binary  hypothesis  testing  problem 


H0  :  £  =  0 

Hi  :  /?  >  0 

based  on  observing  one  stego  object  (one  realization  of  a  random  sequence  with  distribution  Q  ?n)).  The  error  probabilities 
of  these  detectors  are  required  to  satisfy  Pfa  <  PpA  and  Pmd  <  Pmd  for  all  sufficiently  large  n.  The  test  statistic  for 
each  detector  Dn  is  described  next. 

Equation  (1.3)  in  Assumption  3  guarantees  the  existence  of  pair  of  states  (i,j)  such  that  P(X 2  =  (i,j))  7^  Qp{Y\  = 
(i,  j))  for  all  0  >  0.  Thus,  the  test  statistic  v$>rx  for  detector  Dn  is  defined  as 


(1.4) 


=  Vn 


-  P(x?  = 


where  ^zihp[i,j]  is  the  relative  count  of  the  number  of  consecutive  pairs  (i,jf)  in  an  n-element  stego  object  embedded 
using  parameter  /?  (In  terms  of  indicator  functions1,  j]  =  Y^kZi  I{Yfc=i,Y*+ !=;})•  Note  that  due  to  stationarity  of  the 
cover  source,  E  j]J  =  Qp{Y?  =  {ij))  for  all 

The  following  is  proved  for  the  difference  between  the  means  of  v$%n  under  both  hypotheses: 


(1.5) 


lim  E[i/p'n\  —  £[pq ,n]  =  00  when  >/nfi  — >  00. 


Suppose,  for  a  contradiction,  that  there  exists  K  >  0,  and  a  strictly  increasing  sequence  of  integers  for  which 


(1.6)  \E[^nm)  -  £KnJ|  <  K  for  all  m. 

If  lirnsupm^00/?(nm)  =  0o  >  0,  then  there  exists  a  subsequence  of  (um)m=i»  which  will  be  denoted  the  same  to  keep 
the  notation  simple,  such  that  limm_*oo  /3(nm)  =  /?0.  For  this  subsequence,  however,  the  difference 


£K«m]  -  £h,n„l  =  v^rj Qp{Y?  =  (i,j))  -  P(X?  =  (ij)) | 

tends  to  00  with  m-4oc  because  by  (1.3)  the  absolute  value  converges  to  a  positive  value  independent  of  m.  This  is, 
however,  a  contradiction  with  (1.6). 

If  limm_*oo  0(nm)  =  0,  one  find  the  contradiction  in  a  different  manner.  By  the  FI  condition  from  Assumption  3,  there 
must  exist  states  (i,j)  such  that  =  ihj))  7^  0.  From  the  Taylor  expansion2  of  Qp{Y\  —  (i, 3))  at  0  =  0  with 

Lagrange  remainder  and  0  <  £  <  1 


(1.7) 


P ,nm ]  ““  Sfa.nJ  —  \/nrn0 


jpQeMY?  =  ft*))  +  =  (m)) 


which  tends  to  00  as  m  — >  00  when  y/rimfi  — y  00,  which  is  again  a  contradiction  with  (1.6).  In  summary.  E[vpyn\  —  E[i/o,n]  — > 
00  holds  for  any  sequence  fi(n)  for  whicli  y/n0(n)  00. 

Lemma  1  proved  below  shows  that  exponential  forgetting  of  Markov  chains  guarantees  that 


(1.8)  Var[i/0in]  <  C 

for  some  constant  C  independent  of  n  and  0.  Equations  (1.5)  and  (1.8)  are  all  that  is  needed  to  construct  detectors  Dn 
that  will  put  the  Steganographer  at  risk  for  all  sufficiently  large  n.  The  detector  Dn  has  the  following  form 

vpfn  >  T  decide  stego  [0  >  0) 

^,n  <  T  decide  cover  (/?  =  0), 

where  T  is  a  fixed  threshold.  It  is  now  shown  that  T  can  be  chosen  to  make  the  detector  probability  of  false  alarms  and 
missed  detections  satisfy 


Pfa  <  PpA 
Pmd  <  Pmd 

for  sufficiently  large  n.  The  threshold  T(PpA)  will  be  determined  from  the  requirement  that  the  probability  of  the  right 
tail,  x  >  T(PpA ),  under  Ho  is  at  most  Pfa-  Using  Chebyshev’s  inequality, 

Pfa  =  Pr(i/0,„  >T)<  Pr(|i/0.„|  >  T)  <  Vaffi’ni  < 

^or  any  two  statements  A,  #»  !{>*,£}  =  1  if  A  and  B  are  true,  otherwise  =  0. 

^he  Taylor  expansion  is  valid  since  by  its  form  the  function  =  (i,  j))  is  analytic. 
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Setting  T  =  y/C/PpA  gives  Pfa  <  Pfa * 

Because  of  the  growing  difference  between  the  means  (1.5),  one  can  find  n  large  enough  so  that  the  probability  of  the 
left  tail,  x  <  T(PpA ),  under  Hi  is  less  than  or  equal  to  PhD-  Again,  one  can  use  the  Chebyshev’s  inequality  with  the 
bound  on  the  variance  of  i/p^n  to  prove  this: 

PmD  =  Pr{ypyn  <  P{Pfa))  =  Pr{u0,n  ““  E\vp,n  ^n]  <  P(Pfa)  ^(^,n  ~~  ^O.n]) 

<  Pr( \vp,n  —  “  ^O.nJI  >  E[ Vpyn  ~  ^O.n]  ““  ^TPfa))  ^ 


\2  > 


which  can  be  made  arbitrarily  small  for  sufficiently  large  n  because  E[i/ptTl]  —  E[v$%n ]  -4  oo.  Tliis  establishes  the  first  part 
of  the  square  root  law. 

Part  2  [Asymptotic  undetectability]  Now  it  is  proved  that  when  y/nfi  -4  0,  then  the  KL  divergence  between  the 
distributions  of  cover  and  stego  objects  tends  to  zero, 


(1.9) 


-  Bm(/x»II0?>)  =  £  Pl">OT-»nie^y^-|,‘)-»o, 

-  —  vm  —  y i) 


which  will  establish  that  the  steganography  is  e-secure  for  any  e  >  0  for  sufficiently  large  n.  By  the  well-known  connection 
between  hypothesis  testing  and  KL  divergence  [9],  no  nontrivial  upper  bound  on  false  alarms  and  missed  detections  will 
be  met,  for  large  enough  n. 

Using  Taylor  expansion  of  dn(0)  with  Lagrange  remainder  at  0  =  0,  dn{0)  =  dn (Q)-f  ^(0)/?+  ft2,  where  0  <  v  <  1. 

This  step  is  valid  since  under  the  above  assumptions  all  derivatives  of  (normalized)  KL  divergence  are  continuous  w.r.t.  ft 
(the  complete  proof  appears  in  this  technical  report  [18]).  The  term  dn( 0)  is  zero  because  both  distributions  are  the  same 
when  ft  =  0.  The  term  d^(0)  is  also  zero  because 

=  &*<»  -  is  w  -  oW  -  *> 


d'JO) 


=  lim 


-1  d 


log  2  d0 


(E^n)(r”  =  y”))  =  o. 


vr 


Finally,  by  Lemma  2  in  Appendix  there  exists  a  constant  C,  such  that  £d"(ft)  <  C  for  ft  £  [0,  >So]  and  all  n.  Thus, 
dn(fi)  <  \Cn02  -4  0  when  y/n0  -4  0. 

Part  3  [Asymptotic  Ce2-security]  To  prove  the  third  part  of  the  square  root  law,  one  again  expands  the  KL  divergence 
dn(0)  at  0  =  0  up  to  the  third  order  with  the  Lagrange  form  of  the  remainder 


(1.10) 


for  some  0  <  v  <  1.  According  to  [18],  both  normalized  derivatives  of  the  KL  divergence,  £d"(0)  anc*  ^d^'(u^),  are 
upper  bounded  by  the  same  finite  constant  C  for  all  0  €  [0,  A)]-  Since  0(n)y/n  -4  e  with  n  -4  oo,  0(n)  -4  0  and  thus 
the  expansion  is  valid.  By  the  same  reason,  the  second  term  in  (1.10)  converges  to  zero  as  n  — >  oo.  From  this  result,  one 
obtains  the  asymptotic  bound  on  KL  divergence  in  the  form  dLn{0)  <  \Ce2  as  was  to  be  shown. 


1.3.  Discussion.  A  general  theme  is  now  emerging  in  steganography  literature:  whether  steganography  is  performed  in 
a  large  batch  of  cover  objects  or  a  single  large  object,  there  is  a  wide  range  of  situations  in  which  secure  capacity  grows 
according  to  the  square  root  of  the  cover  size.  Such  results  will  likely  hold  for  all  stegosystems  that  are  not  perfectly 
secure  in  the  sense  of  Cachin.  It  appears  that  the  theory  of  hidden  information  is  quite  unlike  the  traditional  theory  of 
information. 

The  result  presented  here  is  the  first  to  allow  dependence  between  the  components  of  the  cover.  The  square  root  law  of 
steganographic  capacity  was  proved  for  single  covers  under  essentially  three  conditions:  that  they  can  be  represented  as  a 
Markov  chain,  that  the  embedding  operation  can  be  modeled  as  independent  substitutions  of  one  state  for  another,  and 
that  the  embedding  scheme  does  not  preserve  all  statistical  properties  of  the  cover.  This  applies  to  a  very  wide  range  of 
popular  steganographic  algorithms,  in  spatial  and  transform  domains.  The  last  condition  is  important  because  it  is  known 
that  perfectly  secure  steganography,  conveying  information  at  a  linear  rate,  can  always  be  constructed  if  the  cover  source 
is  perfectly  understood  [89],  However,  it  can  be  argued  that  digital  media  cover  sources  will  never  be  perfectly  understood. 
To  explain  why  it  makes  sense  to  assume  that  the  warden  has  perfect  knowledge  of  the  covers  (necessary  for  construction 
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of  the  detectors  in  Part  1  of  the  proof),  one  needs  to  look  at  the  cautious  nature  of  Kerckhoffs*  principle:  although  the 
steganographer  may  believe  that  the  warden  does  not  know  the  complete  cover  distribution,  there  is  always  the  risk  that 
the  warden  knows  more  than  the  steganographer  does.  For  example,  she  might  know  more  about  dependencies  between 
cover  elements.  Since  the  steganographer  cannot  say  for  certain  how  much  the  warden  will  know,  they  must  assume  the 
worse  case:  complete  knowledge. 

The  formulation  of  the  theorem  parallels  that  of  [51]:  embedding  at  a  rate  faster  than  y/n  leads  to  eventual  detection, 
whereas  embedding  at  a  rate  slower  than  y/n  leads  to  eventual  £>security.  At  rates  Ay/ri,  the  stegosystem  is  e-secure.  This 
does  not  refer  to  embedding  at  a  diminishing  rate  in  a  single  cover  object  (which  would  be  a  different  theorem,  and  is  an 
avenue  for  future  research).  The  quantity  fi(n)  describes  a  constant  embedding  rate  throughout  an  object  of  size  n:  one 
could  think  of  the  function  /?  as  giving  a  strategy  describing  how  much  data  can  be  hidden  in  an  object  of  each  size.  The 
SRL  says  that  over-ambitious  strategies  lead  to  easier  detection  in  larger  objects,  cautious  strategies  lead  to  more  difficult 
detection  in  larger  objects,  and  quantifies  the  boundary. 

The  square  root  law  has  some  important  implications  in  steganography  and  steganalysis.  Most  significantly,  the  simple 
fact  that  capacity  is  sublinear  means  that  a  true  steganographic  channel,  with  positive  rate,  cannot  be  constructed  unless 
the  cover  source  is  known  perfectly.  For  steganalysis,  the  SRL  explains  in  part  why  the  same  relative  payload  can  be 
detected  more  accurately  in  large  images  (however  it  is  not  the  whole  explanation:  larger  images  tend  to  have  more  smooth 
gradients  and  less  noise,  and  these  factors  also  influence  detection  accuracy).  Thus,  when  benchmarking  steganography, 
the  distribution  of  image  sizes  in  the  database  influences  the  reliability  of  steganalysis  and  makes  it  more  difficult  to 
compare  the  results  on  two  different  databases.  To  resolve  this  issue,  one  might  switch  to  measuring  the  payload  in  bits 
per  square  root  of  pixel,  an  idea  further  explained  in  Section  2.3. 

Finally,  it  should  be  emphasized  that  the  square  root  law  of  capacity  relates  to  the  number  of  changes  caused  by  the 
embedding  process,  and  not  to  the  size  of  the  information  transmitted.  With  adaptive  source  coding  methods  (even  simple 
matrix  embedding  based  on  Hamming  codes  will  do)  the  number  of  bits  of  information  which  can  be  conveyed,  by  making 
at  most  c  changes  in  n  cover  locations,  is  0(clog(n/c)).  Therefore  the  SRL  implies  an  asymptotic  information  capacity 
which  is  0(y^log7i)  (i.e.,  still  sublinear),  in  the  absence  of  perfect  steganography. 


1.4.  Proofs.  In  this  appendix,  the  PI  includes  two  auxiliary  lemmas  needed  in  the  proof  of  the  SRL  in  Section  1.2. 

Lemma  1:  Let  i/^.n  be  the  random  variable  defined  in  (1.4)  for  a  fixed  value  of  the  parameter  /?  and  cover  size  n.  The 
variance  of  this  random  variable  can  be  bounded  by  a  constant  C  for  every  value  of  /?  and  n 


3C,V£,V7i  Var[isp,n}<C. 


Proof:  From  the  definition  of 
(n-l)2 


n-l 


n— 1 


n  —  1 


(l.ii) 


-  Var^.n]  -  '/art!{y|f+,=(M)}l + 

+2[  H  £[I{yt‘+1=(M)}I{y*+*^(<,j)}l“£'lI{yJ,',+,-(<,j)}l£'(I{yt+>-(<,j)}]]  +2n • 

fc+i<£ 


In  the  last  sum,  the  PI  bounded  all  terms  for  k  =  k  —  1  by  1  and  thus  obtained  the  term  2n  in  the  last  inequality.  For 
any  event  A, 

Var[lA)  =  Pr(A)  -  Pr(A)2  =  ±  -  (Pr(A)  -  ± 

so  the  first  term  is  bounded  by 

Finally,  one  finds  an  upper  bound  on  the  sum  in  (1.11)  in  the  form  of  C^n  for  some  positive  constant  C2.  This  will 
give  us  the  proof  because  Var[ i/p>n\  <  ^"^((n  —  1)|  +  2C2n  4-  2n)  <  4(£  *4*  2C2  +  2),  and  <  4  for  n  >  2.  Thus 

C  =  8C2  4"  9. 

The  PI  starts  by  showing  that 


<Mnfc+1  =  (ij),y£+1  =  (ij))  -  Qe(ykk+1  =  (ij))Qe(y£+l  =  m) 

=  [wr1  =  (u)|n*+1  =  {ij))-Qff(y£+l  =  (u))]<Mnfe+1  =  (i,i))  <  N2pk~k~2, 

S  ■  “■  V  ^ 

<N2pk-k-2 


(1.12) 
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for  some  0  <  p  <  1  and  A:  4-  1  <  k  (N  is  the  number  of  all  possible  states  of  the  MC).  In  other  words,  the  HMC  is 
exponentially  forgetting  its  initial  condition.  Then,  one  will  be  able  to  bound  the  sum  in  (1.11)  by  N 2  J2] l=z  ]Cfc= i  p*~k~2  = 
£2.,  ^  <  W’S.,  rfj  -  *'(„  -  2)i^  <  0-  Thus,  C,  =  £  because  Qs(YkM  =  (ij))  <  1. 

The  term  Q^(V^+1  =  (i,  j))  in  (1.12)  can  be  written  as 

(1.13)  Q*(y*+1  =  (i,i))  =  E^(yi+1  =  (M)l*t+1  =  (U))P(*f+1  =  (U))  =  EMi/(^+1  =  (*.3))- 

(*.5)  (>.i) 

The  term  Qg(Y^+1  =  (i, j)|**fc+1  =  (m))  in  (1.12)  can  be  written  as 

<?»(n  -  <*.j)in  -<«» - = iU)) — 

=  (U))  , 

<3^(nfc+1  =  (u)) 

Finally,  P(xj?+1  =  (i,j),Xj!+1  =  (i,j))  can  be  factorized  as  P(Xj?+1  =  (i,j)ix£+1  =  (i,j))P(x£+1  =  ( i,j )).  Due  to 
the  Markov  property  of  the  random  variable  A']*,  P(Xj'+1  =  (i,j')|AfJJ:+1  =  (i,j))  =  P(X^+1  =  (t,j)jXk+i  =  j) ■  For  each 
pair  of  indices  the  PI  defines  the  index  jma x  =  argmaxj  P(x£+1  =  (i,  j)|AVfi  =  j).  Then 

E(IJ)  =  Jmaz)  E(U)  bi.<b3jP(Xt+l  =  (ij)) 

Qe{Ykk+1  =  «.i)) 


(#)  < 


(1.14) 


'  =3)  E  bUb5,jp(xt+1  =  &3)|x*+i  =  i-ax). 


Now,  one  can  combine  (1.13)  and  (1.14)  to  prove  (1.12)  as 


<?*(** +1  =  (u)lnf+1  =  (<,  j))  -  q?(y£+1  =  (<,i» 

(1'13<(1  U  Ebt.*bj(P(Xt+1  =  (U)|**+l  =  Jma. )  -  P(*£+1  =  (?,J))) 

(1.5) 

(1.15)  =  X]  bi,ib‘j.jP(Xk+l  =  5'|**  =  5)  (^(*i  =  <|Xt+l  =  imax)  -  P(X*  =  i))  <  .V2/ -*~2. 

A3) 


It  is  a  well  known  result  in  MCs  that  the  absolute  value  of  the  term  P(X £  =  t|X*+i  =  ]max )  —  P(X^  =  i)  in  (1.15)  can 

be  bounded  by  pk~k~2  (exponential  forgetting),  for  some  constant  0  <  p  <  1.  This  is  because  the  MC  is  irreducible  due 
to  the  assumption  a^j  >  S  (see  equation  (2.2)  on  page  173  in  Doob  [17]).  This  bound  does  not  depend  on  jmox.  The  final 
bound  does  not  depend  on  ft  because  {  <  1  and  6 j  .  <  1. 


Lemma  2:  Let  dn(fi)  =  ^/^(P^IIQ^)  be  the  KL  divergence  between  n-element  cover  and  stego  sources  as  defined 
in  (1.9).  Then, 

30o, 3 C  >  0,  V0  e  (0, 0b], Vn  ~d''(0)  <  C. 

n 

In  other  words,  the  second  derivative  of  the  normalized  KL  divergence,  £d"(0),  can  be  bounded  by  a  constant  C  for  each 
n  and  0.  And  this  bound  does  not  depend  on  n  or  0. 

Proof:  The  problem  of  bounding  normalized  derivatives  of  KL  divergence  for  the  case  of  HMC  was  studied  by  Mevel  et 
al.  [60].  Their  results,  namely  Theorem  4.4  and  Theorem  4.7,  however,  cannot  be  directly  applied  to  our  case  because  our 
assumptions  are  different.  In  particular,  Assumption  C  on  page  1124  is  not  satisfied  because  zeros  are  allowed  in  matrix 
B.  Motivated  by  this  work,  the  PI  derives  a  more  general  result  about  the  normalized  KL  divergence  and  its  derivatives 
(see  the  report  in  [18].  Intuitively,  one  can  expect  the  normalized  KL  divergence  to  be  arbitrarily  smooth  and  bounded 
due  to  the  smooth  transition  from  P  to  Qg  and  the  fact  that  dn(0)  =  0.  The  main  result  of  the  report,  formally  stated 
in  [18],  Theorem  3,  says  that  every  derivative  of  £ dn(0 )  w.r.t.  0  (and  the  function  ^dn(0)  itself)  is  uniformly  bounded 
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and  Lipschitz-continuous  (or  simply  continuous)  on  [0,/?o).  These  properties  are  independent  of  n  >  1.  From  this  fact, 
Lemma  2  can  be  obtained  as  a  special  case.  This  result  also  allows  us  to  expand  the  KL  divergence  into  a  Taylor  series 
with  respect  to  /?. 
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2.  Complete  characterization  of  perfectly  secure  stego-systems  with  mutually  independent 

EMBEDDING  OPERATION 

In  steganography,  the  sender  and  receiver  communicate  by  hiding  their  messages  in  generally  trusted  media,  such  as 
digital  images,  so  that  one  cannot  distinguish  between  the  original  (cover)  objects  and  the  objects  carrying  the  message 
(stego  objects).  Formally,  the  security  of  a  stego-system  is  evaluated  using  the  Kullback-Leibler  divergence  between  the 
distributions  of  cover  and  stego  objects  [9].  Systems  with  zero  KL  divergence  are  called  perfectly  secure. 

Formally,  a  stego-system  is  a  combination  of  an  embedding  algorithm  and  a  cover  source.  The  vast  majority  of  practical 
stego-systems  hide  messages  by  modifying  individual  cover  elements  using  mutually  independent  embedding  operations, 
e.g.,  LSB  and  ±1  embedding,  F5  algorithm,  perturbed  quantization,  MMx,  stochastic  modulation,  and  many  others 
(see  [37]  and  the  references  therein). 

This  section  of  the  report  provides  a  complete  characterization  of  perfectly  secure  stego-systems  for  the  class  of  embed¬ 
ding  algorithms  that  employ  mutually  independent  (MI)  embedding  operations,  which  is  the  class  of  operations  for  which 
the  SRL  was  proved  in  the  previous  section.  The  cover  distributions  of  all  perfectly  secure  systems  form  a  linear  vector 
space  spanned  by  distributions  determined  by  the  embedding  operation.  Moreover,  perfect  security  (zero  KL  divergence) 
is  equivalent  to  satisfying  a  simple  condition  related  to  Fisher  information.  This  result  suggests  that  Fisher  information 
can  be  used  as  an  equivalent  descriptor  of  steganographic  security. 

Definition  1.  Steganography  is  perfectly  secure  iff 

d(P)±DKL(P\\Qfi)=  £  W)l°g^p)=°> 


or  e-secure  if  d(/3)  <  e. 

For  better  flow,  the  PI  reminds  that  the  impact  of  embedding  with  parameter  (}  E  [0,  /?o]  on  the  A>th  element  can  be 
captured  using  the  matrix  6ij(/?)  =  Pr(Yk  =  j\Xk  =  i)  =  Si j  +  Pctj ,  for  some  constants  Cij  >  0  for  i  ^  j,  c*,*  =  -  Ylj  * 
where  Sij  is  the  Kronecker  delta.  In  a  matrix  form,  =  I-f  $C,  where  Bp  =  (&i,j (/?))>  I  is  the  identity  matrix,  and 
C  =  (c^).  It  is  also  assumed  that  embedding  operations  are  mutually  independent,  Pr(yin|X{1)  =  112=1  Pr(Yk\Xk).  By 
the  definition  of  6»j,  the  matrix  Bp  is  stochastic,  btj  =  1.  Finally,  it  is  assumed  that  bi,i(0)  >  0  for  all  0  E  [0,/?o]- 
The  matrix  Bp  represents  an  embedding  algorithm  with  MI  embedding  operation  (simply  MI  embedding).  Examples  of 
practical  embedding  methods  that  fall  under  this  framework  are  shown  in  Figure  2.1. 

To  simplify  the  language  in  this  report,  one  will  speak  of  security  of  a  cover  source  w.r.t.  a  given  MI  embedding  meaning 
that  the  cover  source  is  perfectly  secure  w.r.t.  B,  if  the  resulting  stego-system  is  perfectly  secure.  It  does  then  make  sense 
to  inquire  about  all  possible  perfectly  secure  cover  sources  w.r.t.  MI  embedding  with  matrix  Bp. 

2.1.  A  few  known  results  from  ergodic  theory.  Some  results  from  the  theory  of  ergodic  classes  are  now  reviewed  [17]. 
They  will  be  later  applied  to  the  stochastic  matrix  Bp.  For  states  i,j  E  X.  j  is  called  a  consequent  of  i  (of  order  k)  (i  ->  j) 
iff  3 fc,  (B£)ij  ^  0.  State  i  E  X  is  transient  if  it  has  a  consequent  of  which  it  is  not  itself  a  consequent,  i.e.,  3 j  E  X  such 
that  (t  — >  j)  =>  (j  t4  t).  Furthermore,  t  E  X  is  non-transient  if  it  is  a  consequent  of  every  one  of  its  consequents,  V7  E  X , 
(t  ->  j )  =>  ( j  t).  The  set  X  can  be  decomposed  as  X  =  JF  U  £j  U  •  •  •  U  £*,  where  T  is  the  set  of  all  transient  states 
and  £a,  a  E  {1, . . . ,  fc},  are  so  called  ergodic  classes.  Two  non- transient  states  are  put  into  one  ergodic  class  if  they  are 
consequents  of  each  other. 

Let  matrix  Bp  have  k  ergodic  classes.  Then,  there  exist  k  linearly  independent  left  eigenvectors,  denoted  as  . . . , 
of  matrix  Bp  corresponding  to  eigenvalue  1 ,  called  invariant  distributions.  If  7r(a)Bp  =  7r^a^ ,  for  some  aE  { 1 , . . . ,  Jfc} ,  then 
>  0  for  all  i  E  £aj  and  =  0  otherwise.  Every  other  7r  satisfying  7rBp  =  n  is  obtained  by  a  convex  linear 
combination  of  {7r(a)|a  E  For  a  complete  reference,  see  [17,  Chapter  V,  §2].  The  set  of  ergodic  classes  for 

matrix  Bp  depends  only  on  the  set  {(t,  j)\bij(f})  ^  0}.  Since  bij(fi)  =  0  iff  Cij  =  0  for  i  ^  j  and  &*,*(/?)  >  0  for  E  (0,  $j], 
the  structure  of  ergodic  classes  does  not  depend  on  fi .  Moreover,  if  nBp  =  7r  for  some  0  >  0,  then  7rC  =  0  and  thus  all 
invariant  distributions  are  independent  of  /3,  because  7rBp/  =  7rl  *4-  0'ttC  =  7rl  =  7r.  By  this  reason,  the  index  0  will  be 
frequently  omitted. 

2.2.  Perfectly  secure  cover  sources  under  mutually  independent  embedding  operation.  The  matrix  B  repre¬ 
sents  an  arbitrary  MI  embedding  with  k  ergodic  classes  £a  and  invariant  distributions  7r^a),  a  E  {1, . . . ,  k}.  The  following 
example  describes  a  construction  of  perfectly  secure  cover  sources  w.r.t.  B. 
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rbl  embedding:  |  =  1  —  0  Wi  —  §£  =  0 

F5:  ■  =  !-/?  m  =0  :8  =  i 


n  transient  state  Q  non-transient  state 

Figure  2.1.  Examples  of  several  embedding  methods  and  their  ergodic  classes. 

Example  2.  [Perfectly  secure  cover  sources]  Let  be  a  probability  distribution  on  2-element  cover  objects  defined  as 
pW(X?  =  (i,j))  =  7r^7rjfe)  for  some  a,  b  €  {1  Then  P ^  is  a  perfectly  secure  cover  source  w.r.t.  B  because 

kQ0(2)(Y\2  =  (i,j))  =  (]Hh\iP(Xl  =  i))  (£M  jP(X2  =5)) 

&  =  (7r(a)B)t(7r(6)B)j  =  ni(a)7rj(b)  =  P(2)(X12  =  (i,  j)), 

and  thus  both  distributions  P&\  and  Q  ?  are  identical,  which  implies  perfect  security.  Since  this  construction  does 
not  depend  on  the  particular  choice  of  a,  6  €  {1, . . . ,  &},  one  can  create  k 2  perfectly  secure  cover  sources  w.r.t.  B.  The 
probability  distributions  P^  obtained  from  this  construction  are  linearly  independent  and  form  a  fc2-dimensional  linear 
vector  space.  By  a  similar  construction,  one  can  construct  kn  n-element  linearly  independent  perfectly  secure  cover  sources 
w.r.t.  B. 

It  is  shown  next  that  there  are  no  other  linearly  independent  perfectly  secure  cover  sources  w.r.t.  B. 

Theorem  3.  [Mutually  independent  embedding]  There  are  exactly  kn  linearly  independent  perfectly  secure  probability 
distributions  P  on  n-element  covers.  Every  perfectly  secure  probability  distribution  P  w.r.t.  B  can  be  obtained  by  a  convex 
linear  combination  of  kn  linearly  independent  perfectly  secure  distributions  described  in  Example  2. 

Proof.  It  is  sufficient  to  prove  that  there  cannot  be  more  than  kn  linearly  independent  perfectly  secure  probability  distri¬ 
butions  P  on  n-element  covers.  The  proof  is  shown  for  n  =  2  and  then  its  generalization  is  discussed. 

Define  the  following  matrices  P  =  Pij  =  P(Xf  =  (t,  j)),  and  Q  =  (qij),  Qij  =  Qp(Yi  =  (t\  j)).  By  defininition 
of  MI  embedding, 

9n  =  £  Qa(Y?  =  =  (t;,  w))P(X 2  =  (v,  u0) 

=  ^  ^  bvibwjPvw 

v.w£X 
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Define  matrix  D  =  ( du2  v2 )  of  size  N2  x  N 2,  where  <^2^2  =  bUuVlbU2iV2.  If  p  is  defined  as  one  big  row  vector  of  elements 
pij  and  similarly  q,  then  assuming  perfect  security  of  cover  source  w.r.t.  B  (P  =  Q),  one  obtains  q  =  pB  =  p  and  thus  p 
is  a  left  eigenvector  of  D  corresponding  to  1.  Matrix  D  is  stochastic  and  thus  it  is  sufficient  to  show  that  it  has  k 2  ergodic 
classes. 

This  is  achieved  by  showing  first  that 

(2.1)  u\  ^  v2  (uy  ^  t/i)  and  (u2  ^  u2),  u\,v  1  £  A'2. 


By  u\  ^  v2  it  is  meant  that  v2  is  a  consequent  of  u2  of  order  m  in  terms  of  matrix  O.  If  u2  -l*  v2,  then  there  exist 
m-  1  intermediate  states  1  w2y . . .  ,m_i  w2,  such  that  dUilWdlWt2W  •  •  •  dm_lWt v  >  0.  Since  du7tV 2  =  bUl%VlbU2iV2 ,  this  implies 

the  existence  of  both  paths  tx,  ^  vt  of  order  m,  t  =  1, 2.  The  converse  is  true  by  the  same  reason. 

It  is  now  shown  that  £a  x  £5,  a, 6  £  are  the  only  ergodic  classes.  If  u\  (h!}  v\  and  u2  i/2,  then 

u\  mx^ni)  v2  for  au  ui,i»i  £  £a  and  u2,u2  £  £&,  because  the  path  from  u,  to  Vi  can  be  arbitrarily  extended  by  adding 

self  loops  of  type  j  -4  j  since  all  diagonal  terms  bjj  are  positive  and  thus  by  (2.1)  one  obtains  u2  ^rniifTl2^  ^2  Finally  by 
£  £a  and  u2,u2  £  £&,  u*  -4  m  and  by  the  same  argument  v2  -4  u2,  and  therefore  £a  x  £5  are  ergodic  classes.  Any 
other  state  JUJxfaUJx  J  must  be  transient  w.r.t.  B,  otherwise  by  (2.1)  a  contradiction  with  Ui  £  T  for 

some  i  is  obtained. 

This  proof  can  be  generalized  for  n  >  3  by  proper  definition  of  matrices  P,  Q,  and  B.  In  general,  matrix  B  has 
size  Nn  x  Nn.  By  similar  construction  one  obtains  kn  ergodic  classes  of  generalized  matrix  B.  However,  kn  linearly 
independent  distributions  are  already  known.  □ 


2.3.  Perfect  security  and  Fisher  information.  Here,  it  is  shown  that  for  stego-systems  with  MI  embedding  perfect 
security  can  be  captured  using  Fisher  information.  Ftom  Taylor  expansion  of  KL  divergence,  for  small  /?,  d((3)  =  ^2/(0)-f- 
0(83)  where  7(0)  =  d2d(f3)/df32\p=o  is  the  Fisher  information  w.r.t.  jS.  If  for  some  stego-system  d(fi)  =  0  for  /?  £  [0,  /?o]> 
then  7(0)  =  0  from  the  Taylor  expansion.  Even  though  the  opposite  does  not  hold  in  general,  the  PI  will  prove  that  for  MI 
embedding  zero  Fisher  information  implies  perfect  security.  In  other  words,  a  stego-system  with  MI  embedding  is  perfectly 
secure  for  fi  £  [0,  fio]  if  and  only  if  7(0)  =  0.  This  supplies  a  simpler  condition  for  verifying  perfect  security  than  the  KL 
divergence.  Fisher  information  also  provides  a  connection  to  quantitative  steganalysis  because  l/7(/?)  is  the  lower  bound 
on  variance  of  unbiased  estimators  of  /?.  Moreover,  7(0)  could  be  used  for  comparing  (benchmarking)  stego-systems. 
First,  he  condition  7(0)  =  0  is  reformulated: 

Proposition  4.  Let  P  and  Qp  be  probability  distributions  of  cover  and  stego  objects  with  n  elements  embedded  with 
parameter  (3.  The  Fisher  information  is  zero  if  and  only  if  the  Fl-condition  is  satisfied 

(2-2)  Vtf  6  Xn  (P(X?  =  )  >  0)  =*  '{±Qp(y*)\p_0  =  0). 


Proof.  The  second  derivative  of  d(fi)  at  /?,  d”(fi),  can  be  written  as 


where  Qp(Vi)  =  ^Q^(y?)-  By  P(y J1)  —  Q^o(y?),  the  first  term  in  the  bracket  in  (2.3)  sums  to  zero  at  =  0,  and  thus 
7(0)  is  zero  iff  Q'p(yi)\pm 0  =  0  is  zero  for  all  y?  £  X11  for  which  P(n)(y*)  >  0  as  was  to  be  proved.  Here,  it  is  assumed 
that  the  KL  divergence  d(/J)  be  continuous  w.r.t.  /?,  which  is  valid  by  the  construction  of  the  matrix  B.  □ 

The  next  theorem  shows  that  the  FI  condition  (2.2)  is  equivalent  with  perfect  security  for  MI  embedding. 

Theorem  5.  [Fisher  information  condition]  There  are  exactly  kn  linearly  independent  probability  distributions  P  on  n- 
element  covers  satisfying  the  FI  condition  (2.2).  These  distributions  are  perfectly  secure  w.r.t  B.  Every  other  probability 
distribution  P  satisfying  (2.2)  can  be  obtained  by  a  convex  linear  combination  of  kn  linearly  independent  perfectly  secure 
distributions. 


Proof.  From  Example  2,  kn  linearly  independent  perfectly  secure  distributions  are  available.  By  Taylor  expansion  of  d(i 3), 
these  distributions  satisfy  the  FI  condition,  because  <f(/3)  =  0  =>  7(0)  =  0.  It  is  sufficient  to  show  that  there  cannot  be 
more  linearly  independent  distributions  satisfying  the  FI  condition. 
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Similarly  as  in  the  previous  proof,  the  theorem  is  reformulated  as  an  eigenvector  problem  so  that  one  can  use  ergodic 
class  theory  to  give  the  exact  number  of  left  eigenvectors  corresponding  to  1.  Again,  the  proof  is  first  carried  out  for  the 
case,  n  =  2. 

If  P  satisfies  (2.2),  then  the  linear  term  in  the  Taylor  expansion  of  Qp(y 2)  w.r.t.  is  zero.  By  the  independence 
property,  (Q(t/? |xj)  =  n?=i  Q(y*|a:i)),  and  the  form  of  matrix  B  (B#  =  I  +  /?C),  condition  (2.2)  has  the  following  form 


dQM)  1 

d@  U=o 


d0 


(2.4)  =  =  0- 

Define  matrix  P  =  (pxj)  as  pxj  =  P(X 2  =  (z,  j))  and  represent  it  as  a  row  vector  p.  If  one  defines  matrix  D  =  (qu2  va)  of 
size  N2  x  N2  as 


(2.5) 


Cuuvx  if  7^  and  U2  —  V2 
Cu2tV2  if  ui  =  ui  and  112^^2 
0  otherwise, 


and  a  diagonal  matrix  G  =  (gu*tVf)  of  size  AT2  x.N2  as  <7U2U2  =  —  Cu2,u2,  then  equation  (2.4)  can  be  written  in 

a  compact  form  as  pB  =  p G.  Both  matrices  D  and  G  are  non-negative  by  their  definitions.  Let  H  =  I  +  7(D  —  G). 
Using  7  =  (inax^^^  Puf,^)”1)  the  matrix  H  is  stochastic  and  pH  =  p  iff  pD  =  pG.  Thus,  (2.2)  is  equivalent  with  an 
eigenvalue  problem  for  matrix  H. 

First,  observe  that  for  i  j  Cij  >  0  iff  h(i,a),o\<0  >  0  for  all  a  €  X,  because  by  (2.5)  h^^^  =  7d(t,0),(j,a)  =  7<Hj  (the 
first  case  when  U2  =  V2).  Similarly,  for  i  ^  j  Cij  >  0  iff  A(a,t),(<i.j)  >  0  for  all  a  €  X  (the  second  case  when  u\  =  Ui).  This 
means  that  z  ->  j  iff  (z,a)  -+  (j,a)  w.r.t.  H  for  all  a  €  X  and  similarly  i  ->  j  iff  (a,z)  ->  (a,j)  w.r.t.  H  for  all  a  €  X. 
This  can  be  proved  by  using  the  previous  statement.  By  this  rule  used  for  a  given  u\  G  £a  x  £&,  one  obtains  u\  -4  v2 
and  v2  -4  u\  for  all  v2  £  £a  x  £5  and  thus  £a  x  £5  is  an  ergodic  class  w.r.t.  H.  Te  PI  now  shows  that  there  can  not  be 
more  ergodic  classes  and  thus  all  k2  of  them  are  found.  If  u\  £  F  x  £,  then  u\  has  to  be  transient  w.r.t.  H,  otherwise 
one  would  obtain  contradiction  with  tii  £  F.  This  is  because  the  only  consequents  of  order  1  are  of  type  (z,a)  -4  (j,a) 
or  (a,  z)  -4  (a,j),  therefore  if  u2  £  F  x  £,  one  chooses  u2  €  A  x  £,  such  that  i>i  -/*  ui  (u\  is  transient  and  thus  such  v\ 
must  exist).  State  u\  must  be  transient  otherwise  tz2  44  v2  implies  u\  44  vi,  which  results  in  contradiction  with  i'i  -f*  U\. 
Similarly  for  u2  €  £  x  F  U  F  x  F.  This  proof  can  be  generalized  for  n  >  3  by  assuming  larger  matrices  P,  D,  G,  and  H, 
obtaining  exactly  kn  linearly  independent  perfectly  secure  distributions  satisfying  the  FI  condition.  □ 

Next,  the  PI  discusses  the  structure  of  the  set  of  invariant  distributions  for  a  given  MI  embedding  and  shows  how  to  find 
ergodic  classes  from  matrix  B  in  practice.  By  Theorem  2.1  from  t17,  Chapter  V,  page  175],  this  can  be  done  by  inspecting 
the  matrix  limit  M  =  ( m^j )  =  limn_*oo  ~  According  to  this  theorem,  state  z  is  non-transient  iff  >  0  and  is 

transient  otherwise.  Two  non-transient  states  i,j  £  X  are  included  in  one  ergodic  class  if  mitJ  >  0.  All  rows  of  the  matrix 
M  corresponding  to  states  in  one  ergodic  class  £a  are  the  same  and  equal  to  the  invariant  distribution  of  this  class,  7r^ . 

This  section  is  closed  with  a  short  discussion  of  two  practical  embedding  algorithms.  For  the  F5  embedding  algo¬ 
rithm  [90],  the  set  of  states  X  =  {—1024, ...,1024}.  By  the  nature  of  the  embedding  changes  (flip  towards  0),  there  is 
only  one  ergodic  set  E\  =  {0}  and  F  =  X  \  {0}.  Thus,  there  is  only  one  invariant  distribution,  ixq  =  1  and  zero  otherwise. 
Obviously,  no  message  can  be  embedded  in  covers  with  this  singular  distribution. 

For  the  case  of  LSB  embedding  over  X  =  {0, . . . ,  255},  one  has  £a  =  {2a,  2a  +  1}  for  a  £  {0, . . . ,  127},  F  *  0  and 
7»2^  ==  TToa+i  =  2  and  zero  oth6™56  (LSB  embedding  cannot  be  detected  in  images  with  evened  out  histogram  bins). 
Thus,  sources  realized  as  a  sequence  of  mutually  independent  random  variables  with  such  a  distribution  are  the  only 
perfectly  secure  sources  w.r.t.  LSB  embedding.  Figure  2.1  shows  examples  of  matrices  B  and  ergodic  classes  of  several 
known  algorithms  with  MI  embedding  operation. 


2.4.  Application  to  Markov  cover  sources.  In  this  section,  the  results  obtained  so  far  are  reformulated  for  a  special 
type  of  cover  sources  that  can  be  modeled  as  first-order  stationary  Markov  Chains  (MC).  The  results  play  a  key  role  in 
proving  the  square  root  law  of  steganographic  capacity  of  imperfect  stego-systems  for  Markov  covers  [27,  54]. 

First,  for  stationary  cover  sources  Theorem  3  leads  to  this  immediate  corollary. 

Corollary  6.  There  are  exactly  k  (instead  of  kn)  linearly  indpendent  perfectly  secure  stationary  cover  sources.  These 
sources  are  i.i.d.  with  some  invariant  distribution  7 rQ,  a  €  1, . . . ,  k. 
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The  next  corollary  states  that  in  order  to  study  perfect  security  of  n-element  stationary  MC  covers,  it  is  enough  to 
study  only  2-element  covers. 

Corollary  7.  Let  Pf  Qp  be  first-order  stationary  MC  cover  distribution  and  its  corresponding  stego  distribution  after 
MI  embedding  with  parameter  ft.  For  a  given  n  >2,  an  n-element  stego-system  is  perfectly  secure  iff  the  corresponding 
stego-system  narrowed  to  2-element  cover  source  is  perfectly  secure  for  some  Pq  >  0: 

(2.6)  3/Jo  >  0,  Vy2  G  *2  P<2>(*?  =  y\)  =  Q%( X2  =  y\). 

Moreover ,  the  FI  condition  for  Markov  sources  simplifies  to 

(2-7)  Vy2  G  X2  (p<a>(X?  =  y2)  >  o)  =>  (^JV) |„0  *  o). 

Proof .  Because  invariant  distributions  do  not  depend  on  /?,  Equation  (2.6)  must  be  valid  for  all  &  >  0  once  it  holds  for 
some  Pq  (see  the  arguments  at  the  end  of  Sec.  2.1).  By  Corollary  6,  if  the  stego-system  is  perfectly  secure  (n  >  2),  then 
the  cover  source  is  i.i.d.  with  some  invariant  distribution  w.r.t.  MI  embedding  and  thus  (2.6)  and  (2.7)  hold.  On  the  other 
hand,  if  (2.6)  and  (2.7)  hold  for  n  =  2  and  stationary  cover  source,  then  this  cover  source  is  i.i.d.  with  one  of  k  invariant 
distributions.  This  completes  the  proof  since  2-element  marginal  is  sufficient  statistics  for  a  first-order  stationary  MC.  □ 

2.5.  Discussion.  Most  practical  stego-systems  for  digital  media  embed  messages  by  making  independent  changes  to 
individual  cover  elements.  If  the  embedding  operation  is  fixed,  one  may  inquire  in  which  cover  sources  the  embedding 
is  statistically  undetectable  in  Cachings  sense.  The  main  contribution  of  this  part  of  the  report  is  a  complete  geometric 
characterization  of  such  sources.  Using  the  theory  of  ergodic  classes,  it  was  shown  that  all  cover  sources  that  are  perfectly 
secure  with  respect  to  mutually  independent  embedding  form  a  vector  space  spanned  by  invariant  distributions  determined 
by  the  embedding  operation. 

Additionally,  it  was  shown  that  perfect  security  of  stegosystems  with  mutually  independent  embedding  is  completely 
captured  using  Fisher  information  formulated  in  Section  2.3  as  the  FI  condition.  This  result  not  only  provides  a  simpler  and 
equivalent  condition  for  perfect  security,  but  it  finds  further  applications  in  steganalysis.  For  example,  Fisher  information 
could  be  used  for  benchmarking  such  stego-systems,  a  direction  pursued  in  Section  2.3.  Moreover,  Fisher  information 
provides  fundamental  lower  bounds  on  the  variance  of  unbiased  estimators  of  the  change  rate,  which  connects  our  results 
to  problems  in  quantitative  steganalysis.  Finally,  the  FI  condition  plays  a  key  role  in  proving  the  square  root  law  of 
steganographic  capacity  of  imperfect  stego-systems  [27,  54]  (Section  1.2). 
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3.  Experimental  verification  of  the  square  root  law 

The  PI  conducted  extensive  experiments  to  validate  the  square  root  law  described  in  the  two  sections  above.  Payloads 
will  be  embedded,  using  a  number  of  different  embedding  methods  and  various  payload  lengths,  into  cover  images  of 
different  sizes.  Then  state-of-the-art  steganalysis  methods  will  be  applied  to  these  images  while  looking  for  a  square  root 
relationship.  The  investigation  is  first  directed  to  spatial-domain  embedding  and  then  to  DCT-domain  steganography  for 
which  there  are  some  additional  challenges. 

3.1.  Spatial  domain.  There  are  two  difficulties  to  overcome  in  testing  the  theoretical  result  of  Section  1.  The  first 
is  the  caveat  that  capacity  is  a  square  root  law  all  other  things  being  equal.  Other  literature  on  the  benchmarking  of 
steganalysis  [5,  7]  has  shown  that  there  are  cover  properties  other  than  size  -  local  variance,  saturation,  prior  image 
processing  operations  -  which  significantly  affect  the  detectability  of  payload,  and  it  is  not  possible  to  control  or  even 
determine  them  all.  Therefore  one  cannot  use  sets  of  differently-sized  covers  from  different  sources  to  estimate  how 
capacity  depends  on  size:  variations  in  the  other  properties  may  invalidate  the  results.  Neither  can  one  generate  small 
cover  images  by  downsampling  large  ones,  because  downsampled  images  have  a  higher  semantic  density  so,  usually,  higher 
local  variance.  The  solution  is  to  use  a  single  set  of  large  covers  and  repeatedly  crop  down  to  smaller  images.  In  an 
attempt  to  preserve  other  image  characteristics,  the  cropped  region  can  be  chosen  so  that  the  average  local  variance  (here 
measured  by  average  absolute  difference  between  neighbouring  pixels)  is  as  close  as  possible  to  that  of  the  whole  image. 
The  image  libraries  available  to  the  PI  are  not  large  enough  to  partition  them  into  disjoint  sets  for  cropping  to  different 
sizes,  so  one  may  observe  correlation  between  the  content  of  the  different-sized  cropped  images,  but  this  is  not  expected 
to  cause  significant  effects  in  the  experiments. 

The  second  difficulty  is  to  define  “capacity.”  One  can  set  a  level  of  detection  risk  which  the  steganographer  is  prepared 
to  accept,  but  (even  apart  from  the  fact  that  the  level  itself  will  be  arbitrary)  how-  to  measure  detectability?  As  discussed 
in  [52]  and  [68],  there  are  many  different  detection  metrics  found  in  the  literature.  For  these  experiments  the  following 
three  metrics  wfill  be  considered,  one  standard  and  one  very  recent: 

(1)  The  minimum  sum  of  false  positive  and  false  negative  errors  for  a  binary  classifier  Pe  =  \  min(PfV4  +  Pmd)  (for 
comparability  with  other  measures,  1  —  Pe  is  used); 

(2)  Directly  from  the  observed  cover  and  stego  distributions  of  steganalysis  features,  a  recently-developed  measure 
called  Maximum  Mean  Discrepancy  (MMD).  Its  key  features  are  described  in  Section  3.4. 

In  each  case,  higher  values  denote  lower  security. 

Before  testing  this  hypothesis  empirically,  the  PI  returns  briefly  to  the  definition  of  capacity.  It  is  not  quite  correct  to 
speak  of  capacity  as  a  bound  on  the  size  of  payload  because  it  is  not  payload  itself  wdiich  is  detected  by  steganalysis.  It 
is  the  changes  induced  by  embedding  which  are  detected,  and  capacity  is  more  properly  given  by  a  bound  on  permissable 
changes;  in  simple  embedding  schemes  where  the  changes  are  of  fixed  magnitude,  it  is  the  number  of  changes  one  should 
measure.  This  difference  is  important  because  of  the  existence  of  adaptive  source  codes  [30],  which  can  exploit  freedom  of 
choice  of  embedding  locations  to  reduce  the  number  of  changes  required. 

The  first  series  of  experiments  was  performed  on  never-compressed  cover  images.  A  set  of  3000  images  was  downloaded 
from  the  NRCS  website  [63]:  apparently  scanned  from  film  in  full  colour,  these  images  vary  slightly  in  size  around 
approximately  2100  x  1500  pixels.  The  images  ware  downsampled  to  a  larger  side  of  1024  pixels,  and  reduced  to  grayscale: 
the  same  set  of  images  has  been  used  by  a  number  of  steganalysis  researchers.  Nine  sets  each  of  3000  grayscale  cover 
images  ware  then  created  by  repeated  cropping,  selecting  the  crop  region  best  to  match  the  local  variance  of  the  original, 
to  sizes  100  x  75,  200  x  150,  . . .,  900  x  675. 

Random  payload  was  embedded  using  simple  LSB  replacement  (for  payload  smaller  than  maximum  a  random  selection 
of  embedding  locations  was  used).  Three  different  strategies  ware  selected  for  choosing  the  payload  size  according  to  cover 
size:  embedding  a  fixed-size  payload  in  all  cover  sets,  embedding  payload  proportional  to  the  square  root  of  the  number  of 
cover  pixels,  and  embedding  payload  proportional  to  the  number  of  cover  pixels.  For  each  option,  three  different  constants 
of  proportionality  ware  tested. 

The  method  in  [53]  gives  the  currently-known  best  steganalysis  of  LSB  replacement  in  never-compressed  images,  and  it 
was  applied  to  each  set  of  covers  and  stego  images.  The  accuracies  of  the  resulting  detector  for  payload,  as  measured  by  Pe 
and  MMD.  are  displayed  in  Fig.  3.1,  along  with  90%  confidence  intervals  obtained  using  a  simple  resampling  bootstrap. 
These  experiments  are  in  line  with  the  theoretical  predictions:  whichever  detectability  metric  is  used,  fixed-length  payload 
becomes  harder  to  detect  in  larger  covers,  fixed-proportion  payload  becomes  easier  to  detect,  and  payload  proportional 
to  square  root  of  cover  size  is  (approximately)  of  constant  detectability.  At  least  these  results  suggest  that  square  root 
capacity  is  much  more  plausible  than  proportionate  capacity. 
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Figure  3.1.  Detectability  (t/-axis:  1  —  Pe  and  MMD  on  a  log  scale)  as  a  function  of  cover  size  N  (x- 
axis)  and  payload  size.  90%  bootstrapped  confidence  Intervals  are  indicated.  Left,  fixed  payload  size. 
Middle,  payload  proportional  to  y/N.  Right,  proportional  to  N.  LSB  replacement  steganography  in 
never-compressed  cover  images,  detected  by  method  of  [47], 
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FIGURE  3.2.  Detectability  as  a  function  of  cover  size,  cf.  Fig.  3.1.  LSB  replacement  steganography  in 
previously  JPEG-compressed  digital  camera  images,  detected  by  method  of  [48]. 


The  same  experiments  were  repeated  with  a  set  of  1600  images  taken  by  a  Minolta  DiMAGE  A1  camera  in  raw  format 
at  a  resolution  of  2000  x  1500,  subsequently  converted  to  grayscale  and  subject  to  JPEG  compression  (quality  factor  80). 
The  images  were  cropped  to  16  different  sizes  between  100  x  75  and  full  size,  again  selecting  the  crop  region  to  match  the 
average  local  variance  of  the  original.  When  cover  images  have  been  previously  compressed,  different  detectors  for  LSB 
replacement  have  better  performance  than  that  in  [53],  so  the  PI  used  the  Triples  detector  of  [48]. 

Charts  analogous  to  those  in  Fig.  3.1,  for  the  compressed  cover  images  and  Triples  steganalysis,  are  displayed  in  Fig.  3.2 
and  one  can  draw  similar  conclusions:  secure  payload  is  certainly  not  constant,  nor  proportional  to  cover  size,  but  appears 
to  be  approximately  proportional  to  the  square  root  of  the  cover  size.  More  visible  in  this  second  set  of  experiments  are 
artefacts  in  the  charts  for  very  small  cover  sizes,  but  these  are  to  be  expected  if  the  theoretical  results  are  only  asymptotic 
for  large  covers. 

Finally,  the  PI  tested  an  alternative  method  of  spatial-domain  LSB  embedding  known  as  LSB  matching,  or  ±1  embed¬ 
ding.  It  does  not  have  the  structural  flaws  of  LSB  replacement,  and  seems  much  more  difficult  to  detect.  For  the  detector, 
the  method  known  as  the  adjacency  HCF  COM  found  in  [49]  was  used.  This  detector  is  still  quite  weak:  payloads  as 
small  as  those  in  the  previous  two  experiments  are  undetectable,  so  the  PI  had  to  increase  the  payload  sizes  considerably. 
As  a  result,  it  was  not  possible  to  fit  the  payloads  into  very  small  covers  (one  cannot  embed  more  than  1  bit  per  pixel 
using  LSBs).  The  same  3000  never-compressed  scanned  images  were  used  for  the  first  experiment,  cropped  down  to  ten 
sizes  between  360  x  270  and  900  x  675.  The  resulting  charts  are  displayed  in  Fig.  3.3. 

Observe  that  the  detector’s  performance  remains  very  low:  P&  not  much  below  0.5  (which  corresponds  to  a  random 
detector)  and  MMD  is  near  to  zero  (corresponding  to  identical  distributions  of  cover  and  stego  features)  and,  because  one 
is  digging  in  the  detector  noise,  the  bootstrap  confidence  intervals  are  wider.  However,  similar  features  are  still  apparent: 
falling  detectability  in  larger  covers  when  the  payload  is  fixed  and  rising  detectability  when  the  payload  is  proportional  to 
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FIGURE  3.3.  Detectability  as  a  function  of  cover  size,  cf.  Fig.  3.1.  LSB  matching  steganography  in 
never-compressed  scanned  images,  detected  by  method  of  [49]. 


Figure  3.4.  Capacity  (t/-axes,  determined  by  limit  on  Pe)  as  a  function  of  cover  size  (x-axis),  log-log 
scale,  with  best-fit  trend  lines.  Three  different  steganography /steganalysis  methods  displayed.  Left, 

LSB  replacement  in  never-compressed  images  detected  by  [53];  middle,  LSB  replacement  in  previously 
JPEG-compressed  images  detected  by  [48];  right,  LSB  matching  detected  by  [49". 

cover  size.  When  the  payload  is  proportional  to  the  square  root  of  the  cover  size,  the  detection  metrics  are  approximately 
constant,  although  there  is  a  suggestion  that  the  detectability  may  be  gradually  decreasing. 

To  investigate  more  precisely  how  capacity  depends  on  cover  size  the  PI  performed  additional  experiments:  fixing  on 
just  one  detection  metric  a  bound  was  set  on  the  risk  to  the  steganographer  (a  minimum  value  of  Pe)  and  determined  the 
largest  payload  for  which  the  detection  bound  can  be  met.  This  was  accomplished  by  embedding  100  different  payload 
sizes  in  each  of  the  cover  sets,  measuring  Pe  for  each  combination  and  using  linear  interpolation  to  estimate  Pe  for 
intermediate  payloads.  Denoting  cover  size  (pixels)  by  N  and  capacity  (payload  bits)  by  M ,  one  can  plot  M  against  N 
on  a  log-log  scale:  if  there  is  a  relationship  of  the  form  M  oc  Ne  then  the  points  should  fall  in  a  straight  line  with  slope  e. 

Fig.  3.4  displays  the  results  for  each  of  the  three  detectors  and  cover  sets  in  the  experiments,  with  two  different 
thresholds  for  Pe  (in  the  case  of  LSB  matching,  one  must  set  a  very  high  threshold  for  Pe  because  the  detector  is  so 
weak).  In  each  case,  a  straight  line  fit  is  determined  by  simple  linear  regression.  When  capacity  is  measured  in  this  way, 
it  does  indeed  appear  to  follow  a  relationship  M  oc  ATC,  with  values  of  e  very  close  to  0.5.  Even  the  line  corresponding  to 
PE  =r  0.45  with  the  LSB  matching  detector  would  have  slope  close  to  0.5  if  the  data  points  from  the  smallest  image  sets 
were  discounted.  Unfortunately  one  cannot  use  the  standard  least-squares  tests  for  whether  e  differs  significantly  from 
0.5,  because  the  data  points  are  not  independent  (they  arise  from  images  with  overlapping  content). 

3.2.  Experimental  Investigation:  JPEG  Steganography.  The  experiments  of  the  previous  section  were  repeated  for 
steganography  and  steganalysis  in  JPEG  images,  to  see  whether  the  square  root  law  still  holds.  An  improved  version  of  F5, 
the  so-called  no-shrinkage  F5  (nsF5)  [37]  was  used  as  an  example  of  a  leading  steganographic  method  in  the  JPEG  domain. 
The  nsF5  has  the  same  embedding  operation  but  uses  wet  paper  codes  [34]  to  remove  shrinkage.  The  syndrome-coding 
mechanism  in  nsF5  was  disabled  because  it  introduces  non-linearity  between  the  payload  and  the  number  of  embedding 
changes  [37]. 

Measuring  the  size  of  a  JPEG  image  is  not  as  simple  as  counting  pixels.  After  lossy  compression,  many  of  the 
DCT  coefficients  become  zero  and  do  not  convey  content:  these  coefficients  cannot  be  used  for  embedding.  Therefore 
one  should  define  the  steganographic  size  as  the  total  number  of  nonzero  DCT  coefficients  (abbreviated  nc).  This  is  a 
generally-accepted  measure,  although  some  authors  also  discount  DC  coefficients. 
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Figure  3.5.  Detectability  as  a  function  of  cover  size  (nonzero  DCT  coefficients).  No-shrinkage  F5 
steganography  with  matrix  embedding  disabled,  in  JPEG  covers,  detected  by  method  of  [67], 


The  PI  began  with  approximately  9200  never-compressed  images  of  different  sizes,  and  from  them  cropped  15  sets  of 
cover  images  each  with  a  specified  number  of  nonzero  coefficients,  2  •  104,  4  •  104,  — ,  30  •  104,  all  under  JPEG  compression 
with  quality  factor  80.  None  of  the  images  were  double-compressed.  As  in  the  spatial-domain  experiments,  cropping  was 
favored  over  scaling:  the  latter  produces  images  with  a  higher  number  of  nonzero  DCT  coefficients  on  higher  frequencies, 
so  statistics  of  DCT  coefficients  in  scaled  images  vary  substantially  with  cover  size.  Also  paralleling  the  experiments  in 
the  previous  section,  the  crop  region  was  chosen  to  preserve  some  other  characteristics  of  the  cover.  In  the  case  of  JPEG 
images,  an  attempt  was  made  to  preserve  the  proportion  of  nonzero  DCT  coefficients. 

In  each  set  of  covers,  a  random  message  was  embedded  using  the  nsF5  algorithm.  As  before,  the  strategies  for  choosing 
the  payload  were  to  embed  a  fixed  size  payload  into  all  cover  sets,  to  embed  payload  proportional  to  the  square  root  of 
the  number  of  nonzero  coefficients,  and  to  embed  payload  proportionally  to  the  number  of  nonzero  coefficients. 

The  combination  of  Support  Vector  Machine  (SVM)  classifiers  [16]  with  a  Gaussian  kernel  and  the  274-dimensional 
merged  feature  set  [67]  is  the  state  of  art  general  purpose  steganalytic  system  for  JPEG  images.  The  PI  measured 
detectability  using  SVMs  trained  specifically  to  each  combination  of  cover  and  payload  size:  for  each  such  combination, 
6000  images  were  selected  at  random  from  the  available  set  of  9200,  split  into  disjoint  sets  of  3500  for  training  and  2500 
for  testing.  In  the  training  stage,  the  3500  cover  images  and  3500  corresponding  stego  images  were  used;  similarly  in  the 
testing  stage,  the  2500  cover  images  and  2500  corresponding  stego  images  were  all  classified  by  the  SVM.  The  training 
and  testing  of  the  SVM  classifiers  was  repeated  100  times  with  different  random  selections  of  training  and  testing  sets, 
and  the  overall  1  —  Pe  metric  computed  for  the  resulting  binary  classifiers. 

Additionally,  the  MMD  between  the  “merged  feature  set”  vectors  in  cover  and  stego  images  was  computed.  Again,  6000 
images  were  selected  at  random,  this  time  partitioned  into  disjoint  sets  of  3000  covers  and  3000  stego  images  (disjoint 
sets  are  necessary  for  good  MMD  estimation,  see  Section3.4).  This  was  repeated  100  times  with  random  allocations  of 
cover  and  stego  images:  increasing  the  accuracy  of  the  estimate,  and  also  allowing  the  PI  to  estimate  rough  bootstrap 
confidence  intervals.  Prior  to  computing  MMD,  the  vectors  were  normalized  so  that  each  cover  feature  had  zero  mean 
and  unit  variance:  note  that,  although  the  MMD  kernel  7  parameter  (see  Section  3.4)  is  fixed  for  all  cover  sizes,  the 
normalization  parameters  are  determined  separately  for  each  set.  This  proved  necessary  because  great  variability  was 
observed  in  the  raw  feature  distributions,  as  the  cover  size  varied. 

The  results  of  the  experiment  (Figure  3.5)  confirm  the  theoretical  predictions,  and  are  similar  to  the  results  presented 
for  the  spatial  domain.  For  fixed  (respectively,  linear)  payload,  by  any  metric  the  detectability  increases  (resp.  decreases) 
with  the  cover  size,  and  for  payload  proportional  to  the  square  root  of  nc  the  detectability  is  approximately  constant, 
albeit  with  a  barely-visible  downwards  trend.  It  is  not  known  why  the  MMD  measure  shows  this  as  a  slightly  stronger 
effect  than  Pe- 

Following  the  previous  section,  the  next  experiment  was  to  find  payload  such  that  the  probability  of  error  Pe  matches  a 
certain  level.  The  search  for  the  payload  was  carried  under  the  reasonable  assumption  that  the  detectability  increases  with 
the  payload  size.  The  Pe  measure  at  each  given  payload  was  estimated  by  the  accuracy  of  the  classifier  (again,  a  SVM 
with  a  Gaussian  kernel  employing  “merged”  features)  targeted  to  a  given  combination  of  nc  and  payload.  The  training 
and  testing  conditions  were  the  same  as  in  the  previous  experiment.  Even  though  repeated  training  of  the  classifier  is 
very  time  consuming,  this  approach  was  favoured  because  it  provides  good  estimates  of  Pe- 
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Figure  3.6.  Capacity  (y-axes,  determined  by  limit  on  Pe)  as  a  function  of  cover  size  (x-axis),  log-log 
scale,  with  best-fit  trend  lines.  No-shrinkage  F5  in  JPEG  covers. 

Figure  3.6  shows  maximum  payload  M  plotted  against  nc  N  in  log-log  scale  for  Pe  =  0.1  and  Pe  =  0.25.  Payloads 
were  identified  within  1%  accuracy  of  the  desired  Pe  level.  In  both  cases,  the  graph  shows  a  close  accordance  with  a 
straight  line  and  the  slope  of  the  line  is  close  to  0.5.  This  shows  that  the  capacity  of  JPEG  images  for  nsF5  (without 
matrix  embedding)  grows  with  square  root  of  the  number  of  nonzero  DCT  coefficients. 

3.3.  Discussion.  The  purpose  of  this  work  was  to  verify  the  square  root  law  of  secure  steganographic  payload.  Using 
carefully-designed  experiments,  which  as  far  as  possible  isolate  the  effect  of  cover  size  from  other  cover  properties,  the 
square  root  law  was  tested  for  a  number  of  steganography  schemes,  using  contemporary  steganalysis  detectors.  Close 
adherence  to  the  law  was  observed. 

It  is  not  widely  known  that  the  secure  capacity  of  a  cover  is  proportional  only  to  the  square  root  of  its  size  (where 
size  should  be  measured  by  available  embedding  locations),  in  the  absence  of  perfect  steganography.  It  seems  to  be  of 
fundamental  importance  to  the  practice  of  steganography,  and  could  be  particularly  vital  for  the  design  of  steganographic 
file  systems,  where  the  user  might  expect  to  be  given  an  indication  of  secure  capacity. 

However,  when  interpreting  the  square  root  law  one  must  take  care  not  to  ignore  other  important  factors  which  con¬ 
tribute  to  capacity.  In  practice,  properties  of  cover  images  such  as  saturation,  local  variance,  and  prior  JPEG  compression 
or  image  processing  operations  have  been  shown  to  have  significant  effects  on  detectability  of  payload  [5].  One  cannot 
simply  conclude  that,  because  one  cover  is  twice  as  larger  as  another,  it  can  carry  y/2  times  the  payload  at  an  equivalent 
risk.  The  law  applies  other  all  things  being  equal  and,  as  the  difficulties  constructing  suitable  experiments  to  test  the  law 
illustrate,  rarely  are  cover  images  equal. 

It  should  also  be  emphasised  that  the  law  truly  applies  not  to  raw  payload  size  but  to  the  embedding  changes  caused. 
In  some  embedding  schemes  these  quantities  are  not  proportional.  For  example,  using  syndrome  coding  [30]  and  binary 
embedding  operations  it  is  possible  to  design  embedding  codes  for  which  the  number  of  embedding  changes  c  and  payload 
size  M  approaches  asymptotically  the  bound  c  >  Nh"1(M/N)1  where  h  is  the  binary  entropy  function.  The  consequence 
of  an  asymptotic  limit  c  =  0{y/N)  is  then  M  =  0(y/N log N).  It  would  appear  that,  fundamentally,  steganographic 
payload  capacity  is  of  order  y/N  log  N. 

One  could  argue  that,  because  of  the  square  root  law,  researchers  should  cease  to  report  payloads  measured  in  bits  per 
pixel,  bits  per  second,  bits  per  nonzero  coefficient,  etc:  the  correct  units  should  perhaps  be  bits  per  square  root  pixel  and 
so  on.  However,  such  a  change  would  still  not  allow  comparability  of  different  authors’  benchmarks,  because  of  the  other 
factors  affecting  detectability;  unless  different  authors  use  covers  from  the  same  source,  their  results  cannot  be  exactly 
comparable  in  any  case. 

3.4.  The  MMD  Measure.  Maximum  Mean  Discrepancy  (MMD)  [42]  is  a  recently-developed  measure  of  difference 
between  probability  distributions.  If  X  and  Y  are  random  variables  with  the  same  domain  X  then  their  MMD  is  defined 
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as 

(3.1)  max|E[/(X)j-E[/(r)]|, 

where  the  maximum  is  taken  over  all  mappings  /  :  X  R  from  a  unit  ball  T  in  a  Reproducing  Kernel  Hilbert  Space 
(RKHS).  Although  not  a  true  metric,  the  MMD  is  symmetric,  nonnegative,  and  zero  only  when  X  and  Y  have  the  same 
distribution. 

For  technical  reasons  it  is  simpler  to  use  the  square  of  the  MMD  measure  in  (3.1)  and  in  this  report  the  PI  always 
reports  squared  MMD  values.  Given  n  independent  observations  x  =  (xi,...,xn)  of  X  and  a  further  n  independent 
observations  y  =  (1/1, . . . ,  yn)  of  Y,  the  (squared)  MMD  may  be  estimated  by 

nfn*— ~1)  ^  *(*<•  **)  +  *(»>% )  “  2fc(x«’  Vi) 

where  k  is  a  bounded  universal  kernel  k  :  X  x  X  R  that  defines  the  dot  product  in  the  RKHS  [82].  The  variance  of 
the  estimator  decreases  as  l/y/n,  almost  independently  of  the  dimension  of  the  random  variables  [43],  and  can  also  be 
improved  by  bootstrapping.  MMD  has  been  used  for  comparing  security  of  stego- schemes  in  [68]. 

In  this  report,  MMD  is  measured  w.r.t.  a  Gaussian  kernel 

k(x,y)  =  exp(— 7||x  -  y||2), 

with  the  width  parameter  7  set  to  77~2,  where  77  is  the  median  of  the  I/2-distances  between  (normalized)  features  in  a 
pooled  set  of  all  cover  images.  This  choice  is  justified  in  [68].  Note  that,  for  direct  comparison  of  MMD  values  obtained 
from  experiments  on  different  cover  sets,  the  7  parameter  should  remain  fixed. 
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4.  Fisher  Information  Determines  Capacity  of  e-secure  Steganography 

As  explained  in  Section  1,  most  practical  stegosystems  for  digital  media  work  by  applying  a  mutually  independent 
embedding  operation  to  each  element  of  the  cover.  For  such  stegosystems,  the  PI  has  shown  in  Section  2  that  the  Fisher 
information  w.r.t.  the  change  rate  is  a  perfect  security  descriptor  equivalent  to  KL  divergence  between  cover  and  stego 
images.  Under  the  assumption  that  cover  elements  form  a  Markov  drain,  in  this  section  a  closed-form  expression  for 
the  Fisher  information  is  derived  and  it  is  also  shown  how  it  can  be  used  for  comparing  stegosystems  and  optimizing 
their  performance.  In  particular,  using  an  analytic  cover  model  fit  to  experimental  data  obtained  from  a  large  number  of 
natural  images,  the  PI  proves  that  the  ±1  embedding  operation  is  asymptotically  optimal  among  all  mutually  independent 
embedding  operations  that  modify  cover  elements  by  at  most  1. 

The  key  concept  in  essentially  all  communication  schemes  is  the  channel  capacity  defined  as  the  amount  of  information, 
or  largest  payload,  that  can  be  safely  transmitted  over  the  channel.  As  shown  in  Section  1,  for  imperfect  stegosystems, 
the  communication  rate  is  not  a  good  descriptor  of  the  channel  because  it  approaches  zero  with  increasing  n.  The  sender, 
however,  still  needs  to  know  what  level  of  risk  she  is  exposing  herself  to  when  sending  a  message  to  the  recipient.  It  is 
critical  for  the  sender  to  know  how  much  information  she  can  send  using  her  stegosystem  in  an  n-element  cover  while 
keeping  the  KL  divergence  between  cover  and  stego  objects  below  some  chosen  e.  The  SRL  informs  the  sender  that  the 
amount  of  information  that  she  can  hide  scales  as  Tyfn  [27],  with  r  constant. 

In  this  section,  the  proportionality  constant  r  from  the  SRL  is  proposed  as  a  more  refined  measure  of  steganographic 
capacity  of  imperfect  stegosystems.  By  the  form  of  the  law,  r,  for  which  the  PI  coins  the  term  the  root  rate,  essentially 
expresses  the  capacity  per  square  root  of  cover  size.  Under  the  assumption  that  covers  form  a  Markov  chain  and  embedding 
is  realized  by  applying  a  sequence  of  independent  embedding  operations  to  individual  cover  elements,  a  closed  form 
expression  is  derived  for  the  root  rate.  The  root  rate  depends  on  the  Fisher  information  rate  w.r.t.  the  the  change  rate, 
which  is  a  perfect  security  descriptor  equivalent  to  the  KL  divergence  between  distributions  of  cover  and  stego  objects 
(Section  2).  Expressing  the  Fisher  information  rate  analytically  as  a  quadratic  form  allows  one  to  evaluate,  compare, 
and  optimize  security  of  stegosystems.  To  this  end,  the  PI  derives  an  analytic  cover  model  from  a  large  database  of 
natural  images  represented  in  the  spatial  domain  and  shows  that  the  ±1  embedding  operation  is  asymptotically  optimal 
among  all  mutually  independent  embedding  operations  that  modify  cover  elements  by  at  most  1.  Finally,  using  the  Fisher 
information  rate,  the  PI  compares  security  of  several  practical  stegosystems,  including  LSB  embedding  and  ±1  embedding. 
The  findings  appear  to  be  consistent  with  results  previously  obtained  experimentally  using  steganalyzers  and  are  in  good 
agreement  with  the  experimental  study  of  Section  3. 

The  notation  and  definitions  carry  over  from  Sections  1  and  2. 


4.1.  Fisher  information  in  steganography.  The  PI  now  introduces  the  concept  of  root  rate  as  a  measure  of  capacity 
of  imperfect  stegosystems.  First,  the  relationship  between  steganographic  capacity  of  stegosystems  satisfying  Assumptions 
1-3  and  the  Fisher  information  w.r.t.  the  parameter  P 


(4.1) 


/n(0)  =  Ep 


( 


(n) 

£ 


2l 


is  explained.  Then  in  Section  4.2,  a  closed  form  expression  for  the  Fisher  information  is  derived  and  wTitten  in  terms  of 
the  expected  relative  payload  a  instead  of  parameter  p  as  this  form  is  more  informative  for  the  steganographer. 

Fisher  information  is  a  fundamental  quantity  that  frequently  appears  in  theoretical  steganography  and  in  general  in 
signal  detection  and  estimation.  For  example,  the  Cramer- Rao  lower  bound  states  that  the  reciprocal  of  Fisher  information, 
\/In{P),  is  the  lower  bound  on  the  variance  of  unbiased  estimators  of  P  (quantitative  steganalyzers).  Fisher  information  also 
appears  in  the  leading  term  of  Taylor  expansion  of  the  KL  divergence  dn{P)  =  Dkl(P^\\Q^)  —  /32/n(0)/(2  In  2)-f  0(P3)1 
where 


Dk^P^WQ^)  =  £  P(n)(*?)log2 


P<n>(x?) 


Zero  KL  divergence  implies  zero  Fisher  information.  Although  the  opposite  is  not  true  in  general,  it  holds  for  all  stegosys¬ 
tems  with  MI  embedding  and  arbitrary  cover  model.  For  such  stegosystems,  Fisher  information  7n(0)  represents  a  perfect 
security  descriptor  equivalent  to  the  KL  divergence.  Fisher  information  was  also  proposed  for  benchmarking  steganalyz¬ 
ers  [52]. 

The  relationship  between  the  Fisher  information  rate  and  steganographic  capacity  of  stegosystems  satisfying  Assump¬ 
tions  1-3  was  established  in  Section  2.  It  was  essentially  showm  that  such  stegosystems  are  subject  to  the  Square  Root  Law, 
which  means  that  payloads  that  grow  faster  than  y/n ,  i.e.,  lim^oo  P(n)n/y/n  =  oo,  can  be  detected  arbitrarily  accurately, 
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whereas  payloads  that  grow  slower  than  y/n,  i.e.,  0(n)n/y/n  <  I\  <  oo,  lead  to  ^-secure  stegosystems,  dn{P)  <  £-3  This 
result  means  that  the  payload  that  can  be  securely  transmitted  over  the  steganographic  channel  scales  as  r>Jn.  Conse¬ 
quently,  the  sequence  of  embedding  parameters  ^(n)  must  approach  zero  for  e-secure  systems  and  thus  the  communication 
rate  tends  to  zero.  Due  to  this  fact,  it  makes  sense  to  evaluate  steganographic  capacity  in  the  limit  of  0{n)  — >  0. 


4.2.  Root  rate.  The  problem  of  steganalysis  can  be  formulated  as  the  following  hypothesis  testing  problem 

H0  :  0  =  0 

(4.2)  Hi  :  0>O . 

It  is  shown  show  that  for  small  (and  known)  ft  and  large  n,  the  likelihood  ratio  test  with  test  statistic 


(4.3) 


^T^(X)  =  -J=ln  {Q%\X)/PM{X)), 


is  a  mean-shifted  Gauss-Gauss  problem.4  This  property,  usually  called  the  Local  Asymptotic  Normality  (LAN)  of  the 
detector,  allows  us  to  quantify  and  correctly  compare  security  of  embedding  algorithms  operating  on  the  same  MC  cover 
model  for  small  values  of  0, 

In  this  case,  the  detector  performance  can  be  completely  described  by  the  deflection  coefficient  d2,  which  parametrizes 
the  ROC  curve  as  it  binds  the  probability  of  detection,  P#,  as  a  function  of  the  false  alarm  probability,  Pfa> 

PD=Q{Q-1(PFA)-Vd?). 

Here,  Q(x)  =  1  —  $(x)  and  $(x)  is  the  cdf  of  a  standard  normal  variable  N( 0, 1).  Large  value  of  the  deflection  coefficient 
implies  better  detection  or  weaker  steganography. 

First,  the  PI  states  the  LAN  property  for  the  HMC  model  w.r.t.  the  embedding  parameter  0  and  then  extends  this 
result  with  respect  to  the  relative  payload  a. 

Theorem  8.  [LAN  of  the  LLRT]  Under  Assumptions  1~3,  the  likelihood  ratio  (4.3)  satisfies  the  local  asymptotic  normality 
(LAN),  i.e.,  under  both  hypotheses  and  for  values  of  0  up  to  order  02 

(4.4)  ^(rjn)/n  +  £2// 2)  A  N( 0, 0*1)  under  Ho 

(4.5) 


0n(T^n) /n  —  0*1/2)  —*■  N{0,P*I)  under  H,, 

where  I  is  the  Fisher  information  rate ,  I  =  limn_>oo  ~/n(0),  and  -^4  is  the  convergence  in  distribution .  The  detection 
performance  is  thus  completely  described  by  the  deflection  coefficient 

j2  (0n0*I/2  +  0n0*I/2)2  _o2r 
a  -  JjSj  -np  1. 


Proof.  Due  to  limited  space,  only  a  brief  outline  of  the  proof  is  provided.  The  Gaussianity  of  the  test  statistic  follows  from 
the  Central  Limit  Theorem  (CLT)  due  to  the  fact  that  the  test  statistic  is  close  to  being  i.i.d.  Formal  proof  of  this  uses 
exponential  forgetting  of  the  prediction  filter  [18,  Lemma  9]  and  follows  similar  steps  as  the  proof  of  the  CLT  for  Markov 
chains  [17].  The  mean  and  variance  of  the  likelihood  ratio  (4.3)  is  obtained  by  expanding  (4.3)  in  Taylor  series  w.r.t.  0 
and  realizing  that  the  leading  term  is  the  quadratic  term  containing  the  Fisher  information  rate.  □ 

The  conclusion  of  the  theorem  is  now  rephrased  in  terms  of  the  payload  rather  than  the  parameter  0 .  Matrix  embedding 
(syndrome  coding)  employed  by  the  stegosystem  may  introduce  a  non-linear  relationship  0  =  /(a)  between  both  quantities. 
In  general,  the  payload  embedded  at  each  cover  element  may  depend  on  its  state  i  E  X.  Thus,  the  expected  value  of  the 
relative  payload  that  can  be  embedded  in  each  cover  is  a(0)  =  Kiai(P)>  where  cti{0)  stands  for  the  number  of  bits 
that  can  be  embedded  into  state  i  6  X  and  7 r*  is  the  stationary  distribution  of  the  MC.  The  value  of  0  for  which  a  is 
maximal  will  be  denoted  as  0m ax 

Pm  ax  -  argmaxa(/J). 

P 

For  example,  for  ternary  ±1  embedding  0m ax  =  2/3  and  a i{0MAx)  =  log2  3,  while  for  binary  ±1  embedding  0m ax  =1/2 
and  oti{0MAX )  =  1-  The  matrix  C  is  the  same  for  both  embedding  methods.  The  only  formal  difference  is  the  range  of  the 

^Here,  it  is  assumed  that  there  exists  a  linear  relationship  between  $(n)  and  the  relative  payload  ot(n)  (e.g.,  the  stegosystem  does  not 

employ  matrix  embedding).  Indeed,  application  of  matrix  embedding  does  not  invalidate  our  arguments  as  a(n)  differs  from  £(n)  only  by  a 

multiplicative  factor  bounded  by  logn. 

4In  hypothesis  testing,  the  problem  of  testing  N(no,a2)  vs.  N(p\  ,cr2)  is  called  the  mean-shifted  Gauss-Gauss  problem  and  its  detection 
performance  is  completely  described  by  the  deflection  coefficient  <P  =  (po  —  Pi)2 /a2  [46,  Chapter  3]. 
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parameter  p.  It  is  also  worth  noting  that  unless  all  ati  are  the  same,  the  maximal  payload  will  depend  on  the  distribution 
of  individual  states  7 r*.  . 

To  simplify  our  arguments,  it  will  be  assumed  that  a  linear  relationship  between  P  and  a  exists  (e.g.,  matrix  embedding 
is  not  considered).  Therefore,  one  can  write 

(4.6)  0  =  f(a)  =  ^^-q, 

&MAX 

where  a  £  [0,  a  a*  ax]  and  a  max  =  &(Pmax)  denotes  the  average  number  of  bits  that  can  be  embedded  into  cover  element 
while  embedding  with  p  =  Pm  ax  (maximum  change  rate). 

From  (4.6),  the  deflection  coefficient  can  be  expressed  in  terms  of  the  relative  payload  a  by  substituting  P  =  /(a)  from 
(4.6)  into  Qp 


In  practice,  the  sender  can  control  statistical  detectability  by  bounding  d2  <  e  for  some  fixed  e,  obtaining  thus  an  upper 
bound  on  the  total  number  of  bits  (payload)  an  that  can  be  safely  embedded  (this  requires  rearranging  the  terms  in  (4.7)) 


(4.8) 


an  < 


&MAX 

Pmax 


In  analogy  to  the  communication  rate,  it  is  natural  to  define  the  root  rate 


(4.9) 


A  aMAX 
\fipMAX 


as  the  quantity  that  measures  steganographic  security  of  imperfect  stegosystems  in  bits  per  square  root  of  cover  size  per 
square  root  of  KL  divergence.  The  root  rate  can  be  used  for  comparing  stegosystems  with  a  MC  cover  model. 

In  the  next  theorem,  proved  in  the  appendix,  the  PI  establishes  the  existence  of  the  main  component  of  the  root  rate, 
the  Fisher  information  rate  /,  and  expresses  it  in  a  closed  form. 


Theorem  9.  [Fisher  information  rate]  Let  A  =  (atj)  define  the  MC  cover  model  and  B,  defined  by  matrix  C  =  (cij), 
capture  the  embedding  algorithm.  Then ,  the  normalized  Fisher  information  ln(0)/n  approaches  a  finite  limit  I  as  n  -»  00. 
This  limit  can  be  written  as  I  =  crFc,  where  c  is  obtained  by  arranging  C  into  a  column  vector  of  size  N 2  with  elements 
Cij.5  The  matrix  F  of  size  N2  x  N2  is  defined  only  in  terms  of  matrix  A  and  does  not  depend  on  the  embedding  algorithm . 
The  elements  of  matrix  F  are 


(4.10) 


=  \j  =  l]V{iJ,k)  -  U{i,  jf,  M), 


where  by  the  Iverson  notation  [j  =  l]  is  one  if  j  =  l  and  zero  otherwise  and 

'•,71  a“  > '  =«•  •<*/ 

U(i,  j,  fc,  l)  =  7 n  (aik  -  +  **  ~  * 

Moreover ,  |/n(0)/n  —  I\  <  C/n  for  some  constant  C.  This  constant  depends  only  on  the  elements  of  matrix  A  and  not  on 
the  embedding  algorithm.  The  quadratic  form  /(c)  =  crFc  is  semidefinite,  in  general. 


By  inspecting  the  proof  of  the  theorem,  the  matrix  F  can  be  seen  as  the  Fisher  information  rate  matrix  w.r.t.  the 
parameters  {bij  |1  <  <  N }.  It  describes  the  natural  sensitivity  of  the  cover  source  to  MI  embedding.  The  quadratic 

form  then  combines  these  sensitivities  with  coefficients  given  by  the  specific  embedding  method  and  allows  us  to  decompose 
the  intrinsic  detectability  caused  by  the  cover  source  from  the  detectability  caused  by  the  embedding  algorithm. 


Corollary  10.  For  the  special  case  when  the  MC  degenerates  to  an  i.i.d.  cover  source  with  distribution  P  =  it.  the  Fisher 
information  rate  simplifies  to 


=  E 


7Ti7Tk 

dj——cky 
7T, 


5The  order  of  elements  in  C  is  immaterial  as  far  as  the  same  ordering  is  used  for  pairs  (iyj)  and  (A:,/)  in  matrix  F. 
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4.3.  Maximizing  the  root  rate.  In  the  previous  section,  it  was  established  that  the  steganographic  capacity  of  imperfect 
stegosystems  should  be  measured  as  the  root  rate  (4.9)  defined  as  the  payload  per  square  root  of  the  cover  size  and  per 
square  root  of  KL  divergence.  The  most  important  component  of  the  root  rate  is  the  stegosystem’s  Fisher  information 
rate,  for  which  an  analytic  form  was  derived  in  Theorem  9.  The  steganographer  is  interested  in  designing  stegosystems 
(finding  C)  with  the  highest  possible  root  rate.  This  can  be  achieved  by  minimizing  the  Fisher  information  rate  or  by 
embedding  symbols  from  a  larger  alphabet,  i.e.,  increasing  the  ratio  &max/Pmax-  In  this  section,  two  general  strategies 
are  described  for  maximizing  the  root  rate  that  are  applicable  to  practical  stegosystems.  In  Section  4.6,  conclusions  are 
drawn  from  experiments  when  these  strategies  are  applied  to  real  cover  sources  formed  by  digital  images. 

Before  proceedings  with  further  arguments,  the  PI  points  out  that  the  highest  root  rate  is  obviously  obtained  when  the 
Fisher  information  rate  is  zero,  7  =  0.  This  can  happen  for  non-trivial  embedding  (C  ^  0)  in  certain  sources  because  the 
Fisher  information  rate  is  a  semidefinite  quadratic  form.  Such  stegosystems,  however,  would  be  perfectly  secure  and  thus 
by  Assumption  3  are  excluded  from  our  consideration.6 

The  number  of  bits,  at,  that  can  be  embedded  at  each  state  i  G  X  is  bounded  by  the  entropy  of  the  tth  row  of 
B  =  I  +  /3C,  77(Bt#).  Thus,  in  the  most  general  setting,  one  wishes  to  maximize  the  root  rate 

(B im(pMAx))  I 
Pm  AX  yfl 

w.r.t.  matrix  C.  The  nonlinear  objective  function  makes  the  analysis  rather  complicated  and  the  result  may  depend 
on  the  distribution  of  individual  states  7 r.  Moreover,  even  if  one  knew  the  optimal  solution,  care  needs  to  be  taken  in 
interpreting  such  results,  because  a  practical  algorithm  allowing  us  to  communicate  the  entropy  of  the  additive  noise  may 
not  be  available.  The  PI  is  aware  of  only  a  few  practical  embedding  algorithms  that  communicate  the  maximal  amount 
of  information  (LSB  embedding  with  binary  symbols  and  ±1  embedding  with  ternary  symbols).  In  practice,  stochastic 
modulation  [32]  can  be  used  in  some  cases  to  embed  information  by  adding  noise  with  a  specific  pmf  (matrix  C),  but  the 
specific  algorithms  described  in  [32]  are  suboptimal. 

In  the  rest  of  this  section,  two  different  approaches  are  presented  how  to  optimize  the  embedding  algorithm  under 
different  settings  that  are  practically  realizable. 


4.4.  Optimization  by  convex  combination  of  known  methods.  One  simple  and  practical  approach  to  optimize  the 
embedding  method  is  obtained  by  combining  existing  stegosystems  and  S&K  Suppose  the  sender  embeds  a  portion 
of  the  message  into  An  elements,  0  <  A  <  1,  using  and  use  the  remaining  (1  —  A)n  elements  to  embed  the  rest  of  the 
message  using  S<2L  If  both  sender  and  the  receiver  select  the  elements  pseudo-randomly  based  on  a  stego  key,  the  impact 
on  a  single  cover  element  follows  a  distribution  obtained  as  a  convex  combination  of  the  noise  pmfe  of  both  methods. 
Note  that  the  methods  are  allowed  to  embed  a  different  number  of  bits  per  cover  element  since  the  receiver  knows  which 
symbol  to  extract  from  each  part  of  the  stego  object.  Let  S®  represent  the  ith  embedding  method  with  matrix  C^,  or 
its  vector  representation  c^,  with  ratio  p M  =  ^or  1  ^  {1*2},  The  r0°t  rate  r(^)  the  method  obtained  by 

the  above  approach  (convex  embedding)  with  parameter  A  can  be  written  as 


r(  A)  = 


Apt1)  +  (1  -  A )PW 


(4.11) 


y/( Act1)  +  (1  -  A)c(2))TF(Ac(D  +  (1  -  A)c(2)) 
Apt1)  +  (1  -  A )p<2> 


v/AW  +  (1  -  A ) W  +  2A(1  -  A) JO*2)  ’ 

where  /t*)  is  the  Fisher  information  rate  of  and  /t1*2)  =  (c^J^Fc*2).  Here,  the  symmetry  of  F  was  used  to  write 

/( 1.2)  =  7(2,1). 


4.5.  Minimizing  the  Fisher  information  rate.  In  an  alternative  setup,  one  may  deal  wuth  the  problem  of  optimizing 
the  shape  of  the  additive  noise  pmf  under  the  assumption  that  the  number  of  bits,  at,  embedded  at  each  state  i  G  X  is 
constant.  For  example,  one  may  wish  to  determine  the  optimal  pmf  that  would  allow  communicating  1  bit  per  element 
(a*  =  1,  Vi  G  X)  by  changing  each  cover  element  by  at  most  1.  In  this  problem,  the  ratio  &max/PmaXi  as  well  as  the 
cover  model  (matrix  A),  are  fixed  and  known.  The  task  is  to  minimize  the  Fisher  information  rate  7. 

The  PI  now  formulates  the  optimization  problem  by  restricting  the  form  of  the  matrix  C  =  (Cij),  or  its  vector  repre¬ 
sentation  c  =  (ctj)  G  RN  xl,  to  the  following  linear  parametric  form 

(4.12)  c  =  Bu  +  e, 

6An  example  of  such  a  stegosystem  Is  LSB  embedding  in  i.i.d.  covers  with  7T2»  =  7r2i+i  for  all  t. 
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where  B  =  ( dij )  is  a  full-rank  real  matrix  of  size  N2  x  fc,  e  is  a  real  column  vector  of  size  N 2,  and  v  =  (vi, . . . ,  Vk)T  is 
a  fc-dimensional  column  vector.  It  will  be  assumed  that  v  £  V,  where  V  is  bounded  by  a  set  of  linear  inequalities7  and 
the  constraint  £ .  c»j  =  0  for  all  i  £  {1, . . . ,  N}.  In  other  words,  the  matrix  C  is  decomposed  into  k  real  parameters  Vi, 
i  £  {1, . . . ,  k}.  The  following  example  shows  one  such  representation  for  a  stegosystem  whose  embedding  changes  are  at 
most  1. 

Example  11.  [Tridiagonal  embedding]  Set  ca  =  -1,  and  c^+i  =  1  -  \  for  i  £  {2, . . . ,  N  -  1}  (and 

suitably  defined  at  the  boundaries).  This  allows  modeling  ±1  embedding,  LSB  embedding,  and  all  possible  MI  embedding 
methods  that  modify  every  element  by  at  most  1.  By  setting  ca  =  —1  for  all  z,  the  sender  constraints  her  choices  to 
stegosystems  that  embed  the  same  payload  into  every  state  i  £  X  for  all  $  >  0.  This  model  has  k  =  N  -  2  parameters 
and  the  set  V  is  formed  by  Vj  £  [0, 1],  j  £  {1, . . . ,  k}. 

The  sender’s  task  is  to  minimize  the  Fisher  information  rate  for  embedding  methods  given  by  (4.12).  The  function 
/( v)  =  (Du  4-  e)TF(Bu  -f  e)  can  attain  its  minimum  either  at  a  point  with  a  zero  gradient8  (a  critical  point)  or  on  the 
boundary  of  V.  The  PI  now  derives  a  set  of  linear  equations  for  the  set  of  all  possible  critical  points.  This  approach 
will  be  used  in  Section  4.G  to  prove  that  ternary  ±1  embedding  is  asymptotically  optimal  within  the  class  of  tridiagonal 
embedding  in  spatial  domain. 

For  the  chosen  parametrization,  the  gradient  w.r.t.  every  parameter  vj  can  be  expressed  as 

ri  r) 

=  t^-(Bv  +  e)rF(Bu  -f  e)  =  2(B.y)TF(Bt>  -f  e), 

where  B#;  is  the  jth  column  of  matrix  B.  Because  every  possible  candidate  vq  for  the  optimal  parameters  must  satisfy 
{d/dvj)I(v)\v=VQ  =  0  for  every  j  £  {1, . . . ,  A;},  all  critical  points  are  solutions  of  the  following  linear  system 

(4.13)  BrFPt>  =  — BrFe. 

If  this  system  has  a  unique  solution  vo  €  V,  then  vq  corresponds  to  matrix  C  achieving  the  global  minimum  of  the  Fisher 
information  rate,  which  corresponds  to  the  best  MI  embedding  method  w.r.t.  V  and  a  given  MC  cover  source. 


4.6.  Experiments.  In  the  previous  section,  two  strategies  for  maximizing  the  root  rate  for  practical  stegosystems  were 
outlined.  This  section  presents  specific  results  when  these  strategies  are  applied  to  stegosystems  operating  on  8-bit  gray¬ 
scale  images  represented  in  the  spatial  domain.  Although  images  are  two  dimensional  objects  with  spatial  dependencies  in 
both  directions,  they  arc  represented  in  a  row- wise  fashion  as  a  first-order  Markov  Chain  over  X  =  {0, . . . ,  255}.  The  MC 
model  represents  the  first  and  simplest  step  of  capturing  pixel  dependencies  while  still  retaining  the  important  advantage 
of  being  analytically  tractable.  Then,  a  parametric  model  is  adopted  for  the  transition  probability  matrix  of  this  Markov 
cover  source  and  it  is  shownto  be  a  good  fit  for  the  empirical  transition  probability  matrix  A  estimated  from  a  large 
number  of  natural  images.  This  analytic  model  is  used  to  evaluate  the  root  rate  (4.9)  of  several  stegosystems  obtained  by 
a  convex  combinations  of  known  methods.  Finally,  it  is  shown  that  the  optimal  embedding  algorithm  that  modifies  cover 
elements  by  at  most  1  is  very  close  to  ±1  embedding. 

In  principle,  in  practice  one  could  calculate  the  Fisher  information  rate  using  equation  (4.10)  with  an  empirical  matrix 
A  estimated  from  a  large  number  of  images.  However,  this  approach  may  give  misleading  results  because  (4.10)  is  quite 
sensitive  to  small  perturbations  of  dij  with  a  small  value  (observe  that  I  =  -foe  if  dij  =  0).  This  is  not  going  to  be  an  issue 
in  practice  since  rare  transitions  between  distant  states  are  probable  but  content  dependent,  which  makes  them  difficult 
to  be  utilized  for  steganalysis.  Because  small  values  of  a<j  can  not  be  accurately  estimated  in  practice,  the  matrix  A  is 
represented  with  the  following  parametric  model 


(4.14) 


_ _ Le-(i<-jfi/T)1' 

,J  “  Zie 


where  Zi  =  e~^x~i^T^  is  the  normalization  constant.  The  parameter  7  controls  the  shape  of  the  distribution, 

whereas  r  controls  its  “width.”  The  model  parameters  were  found  in  the  logarithmic  domain  using  the  least  square  fit 
between  (4.14)  and  its  empirical  estimate.  To  validate  this  model,  the  least  square  was  carried  out  fit  separately  for 
three  image  databases:  never  compressed  images  taken  by  several  digital  cameras9  (CAMRAW),  digital  scans10  (NRCS), 


7E.g.,  one  must  have  B  >  0. 

sNote  that  the  semidefiniteness  of  F  guarantees  that  the  extremum  must  be  a  minimum. 

^Expanded  version  of  CAMERA_RAW  database  from  [41]  with  4547  8-bit  images. 

10Contains  2375  raw  scans  of  negatives  coming  from  the  USDA  Natural  Resources  Conservation  Service  (http://photogallery.nrcs.usda. 
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Figure  4.1.  Left:  plot  of  the  empirical  matrix  A  estimated  from  CAMRAW  database  in  log  domain. 
Right:  comparison  of  the  128th  row  of  matrix  A  estimated  from  the  same  database  with  the  analytic 
model  (4.14). 


and  decompressed  JPEG  images11  (NRCS-JPEG).  Figure  4.1  shows  the  comparison  between  the  empirical  matrix  A 
estimated  from  the  CAMRAW  database  and  the  corresponding  fit.  Although  this  model  cannot  capture  some  important 
macroscopic  properties  of  natural  images,  such  as  pixel  saturations,  it  remains  analytically  tractable  and  is  valid  for  many 
natural  images. 

The  left  part  of  Figure  4.2  shows  the  root  rate  (4.11),  r(A),  for  a  convex  combination  of  LSB  and  ±1  embedding, 
A  €  [0, 1],  for  different  image  sources.  The  higher  the  root  rate  r(A),  the  better  the  stegosystem.  The  results  are  consistent 
with  the  thesis  that  dbl  embedding  is  less  detectable  than  LSB  embedding.  Similarly,  the  capacity  of  stegosystems  with 
covers  from  NRCS  (scans)  is  believed  to  be  higher  than  the  capacity  of  stegosystem  with  decompressed  JPEGs  or  images 
from  digital  cameras.  This  fact  is  in  agreement  with  the  obtained  result  for  all  values  of  the  convex  combination  of  LSB 
and  ±1  embedding  and  it  can  be  attributed  it  to  the  fact  that  scans  contain  a  higher  level  of  noise  that  masks  embedding 
changes.  In  contradiction  with  expectations,  decompressed  JPEGs  from  NRCS-JPEG  have  a  higher  root  rate  than  raw 
images  from  digital  cameras  (CAMRAW).  This  phenomenon  is  probably  caused  by  the  simplicity  of  the  MC  model,  which 
fails  to  capture  JPEG  artifacts  because  they  span  across  larger  distances  than  neighboring  pixels. 

The  methodology  described  in  Section  4.5  is  now  used  to  maximize  the  root  rate  with  respect  to  stegosystems  that 
modify  each  cover  element  by  at  most  1.  It  is  done  for  the  cover  model  fit  obtained  from  the  NRCS  database.  Assuming 
the  embedding  operation  is  binary,  it  can  embed  one  bit  per  cover  element.  Thus,  it  is  sufficient  to  find  the  MI  embedding 
that  attains  the  minimum  Fisher  information  rate.  The  PI  used  the  parametrization  from  Example  11  and  solved  the 
system  of  equations  (4.13).  This  system  has  only  one  solution  v  =  (vi, . . .  ,1^54)  €  V  =  [0,  l]254  and  thus  it  represents  MI 
embedding  with  the  minimum  Fisher  information  rate.  This  solution  is  shown  in  the  right  part  of  Figure  4.2  along  with 
the  representation  of  the  ±1  embedding  operation.  The  optimal  MI  embedding  differs  from  ±1  embedding  only  at  the 
boundary  of  the  dynamic  range.  This  is  due  to  the  finite  number  of  states  in  the  MC  model.  It  was  experimentally  verified 
that  the  relative  number  of  states  with  It/*  —  0.5|  >  S  tends  to  zero  for  a  range  of  S  >  0  as  N  00  for  fixed  parameters 
of  the  analytic  model.12  Thus,  the  boundary  effect  is  negligible  for  large  N.  This  suggests  that  the  loss  in  capacity  when 
using  ±1  embedding  algorithm  is  negligible  for  large  N  or,  in  other  words,  ±1  embedding  is  asymptotically  optimal. 

4.7.  Discussion.  In  sharp  contrast  with  the  well-established  fact  that  the  steganographic  capacity  of  perfectly  secure 
stegosystems  increases  linearly  with  the  number  of  cover  elements,  n,  the  square  root  law  states  that  steganographic 
capacity  of  a  quite  wide  class  of  imperfect  stegosystems  is  only  proportional  to  y/n.  The  communication  rate  of  imperfect 
stegosystems  is  thus  non-informative  because  it  tends  to  zero  with  n.  Instead,  an  appropriate  measure  of  capacity  is  the 
constant  of  proportionality  in  front  of  y/n ,  for  which  the  term  the  root  rate  is  coined  whose  unit  is  bit  per  square  root 
of  cover  size  per  square  root  of  KL  divergence.  The  root  rate  is  shown  to  be  inversely  proportional  to  the  square  root  of 
the  Fisher  information  rate  of  the  stegosystem.  Adopting  a  Markov  model  for  the  cover  source,  an  analytic  formula  for 
the  root  rate  can  be  derived  with  Fisher  information  rate  expressible  as  a  quadratic  form  defined  by  the  cover  transition 
probability  matrix  evaluated  at  a  vector  fully  determined  by  the  embedding  operation.  This  analytic  form  is  important 
as  it  enables  comparing  capacities  of  imperfect  stegosystems  as  well  as  optimizing  their  embedding  operation  (maximize 

11  Images  from  NRCS  database  compressed  with  JPEG  quality  factor  70. 

12The  same  is  likely  to  be  true  for  all  S  >  0. 
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FIGURE  4.2.  Left:  the  root  rate  r( A)  =  a\iAx/{pMAX'fi)  of  a  convex  combination  of  LSB  and  ±1 
embedding  for  different  image  sources.  Right:  optimal  parameters  v  =  (ui, . . .  ,1*254)  of  MI  embedding 
(4.12)  minimizing  the  Fisher  information  rate  while  modifying  cover  elements  by  at  most  1.  The  difference 
between  ±1  embedding  and  optimal  MI  embedding  is  due  to  boundary  effects  that  vanish  as  N  -4  00. 


the  root  rate).  By  fitting  a  parametric  model  through  the  empirical  transition  probability  matrix  for  neighboring  pixels  of 
real  images,  the  model  is  used  to  compute  and  compare  the  root  rate  of  known  steganographic  schemes  and  their  convex 
combinations.  In  agreement  with  results  previously  established  experimentally  using  blind  steganalyzers,  the  analysis 
indicates  that  ternary  ±1  embedding  is  more  secure  than  LSB  embedding  and  it  is  also  optimal  among  all  embedding 
methods  that  modify  pixels  by  at  most  1.  Furthermore,  by  analyzing  image  databases  of  raw  images  from  different  sources, 
it  has  been  established  that  the  root  rate  is  larger  for  images  with  higher  noise  level  as  is  to  be  expected.  Among  the 
surprising  results  of  our  effort,  the  PI  points  out  the  fact  that  the  root  rate  for  ±1  embedding  is  only  about  twice  larger 
than  for  LSB  embedding,  which  contrasts  with  the  fact  that  current  best  steganalyzers  for  LSB  embedding  are  markedly 
more  accurate  than  the  best  steganalyzers  of  ±1  embedding.  This  hints  at  the  existence  of  significantly  more  accurate 
detectors  of  ±1  embedding  that  are  yet  to  be  found. 


4.8.  Proofs.  Proof  of  Theorem  9:  Only  the  main  idea  of  the  proof  is  presented  in  this  report,  leaving  the  main  bulk 
of  technical  details  to  the  report  [19].  The  decomposition  of  the  sequence  J„(0)/n  to  a  quadratic  form  and  its  properties 
can  be  obtained  directly  from  the  definition  of  Fisher  information 

VwL- 


lrM  In  2 

Ln(0)  —  V/52' 

n  n  op 2 


-  -  £  §  W,Q]  ( t  L)  (fru) 


-Ckl 


The  derivatives  of  the  log-likelihood  are  evaluated  at  B  =  I  because  B(fi)  =  I  +  J3C  and  0  =  0.  By  using  Qp{yi)  = 
Ylxn€Xn  lx?)»  random  variable  i,  j,  k ,  Z)  does  not  depend  on  the  embedding  method.  This  is  because 

the  derivatives  are  evaluated  at  IB  =  I  and  thus  only  contain  the  elements  of  the  cover  source  transition  matrix  A.  The 
proof  of  the  convergence  of  —  ^Ep[g(Y{1,  i,j,  fc,  Z)]  to  f(i%j),(k,i)  an^  its  closed  form  is  more  involved  and  is  presented  in  the 
report  [19].  The  semidefinitness  of  the  quadratic  form  follows  from  semidefiniteness  of  the  Fisher  information  matrix  F. 
It  is  not  positively  definite  because  for  an  i.i.d.  cover  source  all  rows  of  matrix  F  coincide  and  are  thus  linearly  dependent. 
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5.  Constructing  Practical  Stegosystems  using  Syndrome-Trellis  Codes 

The  previous  sections  describe  the  achievements  in  fundamental  understanding  of  stcganographic  capacity  in  imperfect 
stegosystems.  This  section  describes  a  complete  practical  methodology  for  constructing  minimal-distortion  steganographic 
systems  in  practice.  By  this,  the  PI  means  how  to  embed  a  given  payload  while  introducing  the  smallest  possible  distortion. 
As  long  as  one  can  tie  the  distortion  to  statsitical  detectability,  this  approach  provides  (near)  optimal  constructions. 

The  idea  is  to  proceed  by  steps  from  an  easier  problem  to  more  complex  ones.  The  simplest  problem  is  to  optimize 
the  embedding  for  an  additive  binary  embedding  operation  defined  as  follows:  First,  assign  to  each  over  element  a  scalar 
expressing  the  distortion  of  an  embedding  change  done  by  replacing  the  cover  element  by  its  “other”  value  (here,  the 
assumption  is  that  the  embedding  operation  is  binary).  The  total  distortion  is  assumed  to  be  a  sum  of  per-elemcnt 
distortions.  Both  the  payload-limited  sender  (minimizing  the  total  distortion  while  embedding  a  fixed  payload)  and  the 
distortion-limited  sender  (maximizing  the  payload  while  introducing  a  fixed  total  distortion)  axe  considered  in  this  section. 
The  non-binary  case  can  be  decomposed  into  several  binary  cases  by  replacing  the  individual  bits  in  cover  elements  as 
described  in  one  of  the  papers  by  the  PI  [25,  26],  The  binary  case  can  be  resolved  using  a  novel  syndrome-coding  scheme 
based  on  dual  convolutional  codes  equipped  with  the  Viterbi  algorithm.  This  fast  and  very  versatile  solution  achieves 
state-of-the-art  results  in  steganographic  applications  while  having  linear  time  and  space  complexity  w.r.t.  the  number 
of  cover  elements.  The  PI  illustrates  the  power  of  the  constructions  for  various  relative  payloads  and  different  distortion 
profiles,  including  the  wet  paper  channel.  This  framework  substantially  improves  upon  current  coding  schemes  used  in 
steganography,  such  as  matrix  embedding  and  wet  paper  codes. 

There  exist  two  mainstream  approaches  to  steganography  in  empirical  covers,  such  as  digital  media  objects:  steganogra¬ 
phy  designed  to  preserve  a  chosen  cover  model  and  steganography  minimizing  a  heuristically-defined  embedding  distortion. 
The  strong  argument  for  the  former  strategy  is  that  provable  undetectability  can  be  achieved  w.r.t.  a  specific  model.  The 
disadvantage  is  that  an  adversary  can  usually  rather  easily  identify  statistical  quantities  that  go  beyond  the  chosen  model 
that  allow  reliable  detection  of  embedding  changes.  The  latter  strategy  is  more  pragmatic  -  it  abandons  modeling  the 
cover  source  and  instead  tells  the  steganographer  to  embed  payload  while  minimizing  a  distortion  function.  In  doing  so, 
it  gives  up  any  ambitions  for  perfect  security.  Although  this  may  seem  as  a  costly  sacrifice,  it  is  not,  as  empirical  covers 
have  been  argued  to  be  incognizable  [6],  which  prevents  model-preserving  approaches  from  being  perfectly  secure  as  well. 

While  the  relationship  between  distortion  and  steganographic  security  is  far  from  clear,  embedding  while  minimizing  a 
distortion  function  is  an  easier  problem  than  embedding  with  a  steganographic  constraint  (preserving  the  distribution  of 
covers).  It  is  also  more  flexible,  allowing  the  results  obtained  from  experiments  with  blind  steganalyzers  to  drive  the  design 
of  the  distortion  function.  In  fact,  today’s  least  detectable  steganographic  schemes  for  digital  images  [55,  93,  71,  65]  were 
designed  using  this  principle.  Moreover,  when  the  distortion  is  defined  as  a  norm  between  feature  vectors  extracted  from 
cover  and  stego  objects,  minimizing  distortion  becomes  tightly  connected  with  model  preservation  insofar  the  features  can 
be  considered  as  a  low-dimensional  model  of  covers.  This  line  of  reasoning  already  appeared  in  [56,  65]  and  was  further 
developed  in  [23]. 

With  the  exception  of  [23],  steganographers  work  with  additive  distortion  functions  obtained  as  a  sum  of  single-letter 
distortions.  A  well-known  example  is  matrix  embedding  where  the  sender  minimizes  the  total  number  of  embedding 
changes.  Near-optimal  coding  schemes  for  this  problem  appeared  in  [31,  21],  together  with  other  clever  constructions 
and  extensions  [98,  94,  96,  22,  97,  95].  When  the  single-letter  distortions  vary  across  the  cover  elements,  reflecting  thus 
different  costs  of  individual  embedding  changes,  current  coding  methods  are  highly  suboptimal  [55,  71]. 

5.1.  Distortion  function.  For  concreteness,  and  without  loss  of  generality,  x  will  be  called  an  image  and  x,  its  zth  pixel, 
even  though  other  interpretations  are  certainly  possible.  For  example,  may  represent  an  RGB  triple  in  a  color  image,  a 
quantized  DCT  coefficient  in  a  JPEG  file,  etc.  Let  x  =  (xi, . . .  ,xn)  E  X  =  {Z}n  be  an  n-pixel  cover  image  with  the  pixel 
dynamic  range  Z.  For  example,  Z  =  {0, . . . ,  255}  for  8-bit  grayscale  images. 

The  sender  communicates  a  message  to  the  receiver  by  introducing  modifications  to  the  cover  image  and  sending  a 
stego  image  y  =  (yi, . . .  ,yn)  ey  -1\  x  Z2  x  •••x2n,  where  Z*  C  Z  are  such  that  x*  E  Z The  embedding  operation 
is  called  binary  if  |Z*|  =  2,  or  ternary  if  |Z*|  =  3  for  every  pixel  i.  For  example,  ±1  embedding  (sometimes  called  LSB 
matching)  can  be  represented  by  Z»  =  {x*  —  l,Xi,Xi  +  1}  with  appropriate  modifications  at  the  boundary  of  the  dynamic 
range. 

The  impact  of  embedding  modifications  will  be  measured  using  a  distortion  function  D.  The  sender  will  strive  to  embed 
payload  while  minimizing  D.  This  section  is  limited  to  an  additive  D  in  the  form13 


13The  case  of  embedding  with  non-additive  distortion  functions  is  addressed  in  [23]  by  converting  it  to  a  sequence  of  embeddings  with  an 
additive  distortion. 
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n 

(5-1)  D(x,  y)  =  Y^,Pi(x>yi)> 

t=l 

wliere  p,  :  X  xZi  ->  [-K,  K ],  0  <  K  <  oc*  are  bounded  functions  expressing  the  cost  of  replacing  the  cover  pixel  xT-  with 
pi.  Note  that  pi  may  arbitrarily  depend  on  the  entire  cover  image  x*  allowing  thus  the  sender  to  incorporate  inter-pixel 
dependencies  [65].  The  fact  that  the  value  of  p*(x,  yi)  is  independent  of  changes  made  at  other  pixels  implies  that  the 
embedding  changes  do  not  interact. 

The  boundedness  of  D(x,  y)  is  not  limiting  the  sender  in  practice  since  the  case  when  a  particular  value  yi  is  forbidden 
(a  requirement  often  found  in  practical  steganographic  schemes  [37])  can  be  resolved  by  excluding  yi  from  2i.  In  practice* 
the  sets  2i,i€{l,...*n}*  may  depend  on  cover  pixels  and  thus  may  not  be  available  to  the  receiver.  To  handle  this  case* 
the  domain  of  pt  is  expanded  to  X  x  X  and  defined  p*(x,  y<)  =  oo  whenever  pi  ^2*. 

The  definition  of  the  distortion  function  is  intentionally  kept  rather  general.  In  particular*  it  is  not  required  that 
Pi(x,X{)  <  pi(x*yj)  for  all  pi  €  2*  to  allow  for  the  case  when  it  is  actually  beneficial  to  make  an  embedding  change  instead 
of  leaving  the  pixel  unchanged.  An  example  of  this  situation  appears  in  [23]. 

5.2.  Problem  formulation.  This  section  contains  a  formal  definition  of  the  problem  of  embedding  while  minimizing  a 
distortion  function.  The  performance  bounds  are  stated  and  some  numerical  quantities  are  defined  that  will  be  used  to 
compare  the  coding  methods  w.r.t.  each  other  and  to  the  bounds. 

It  is  assumed  that  the  sender  obtains  her  payload  in  the  form  of  a  pseudo-random  bit  stream,  such  as  by  compressing 
or  encrypting  the  original  message.  Moreover*  the  embedding  algorithm  associates  every  cover  image  x  with  a  pair  { y ,  7 r}* 
where  y  is  the  set  of  all  stego  images  into  which  x  can  be  modified  and  7 r  is  their  probability  distribution  characterizing 
the  sender’s  actions,  7r(y)  =  P(Y  =  yjx).  Since  the  choice  of  depends  on  the  cover  image*  all  concepts  derived 

from  these  quantities  necessarily  depend  on  x  as  well.  Thinking  of  x  as  a  constant  parameter  that  is  fixed  in  the  very 
beginning *  the  dependency  on  x  is  not  made  explicit.  For  this  reason*  one  can  simply  write  D( y)  =  D(x*  y). 

If  the  receiver  knew  x,  the  sender  could  send  up  to 

(5.2)  h{  tt)  =  -  Yl  *(y)  los  *<y) 

yey 

bits  on  average  while  introducing  the  average  distortion 

(5.3)  e^d]  =  *(y)D(y) 

ysy 

by  choosing  the  stego  image  according  to  7r.  By  the  GePfand-Pinsker  theorem  [39],  the  knowledge  of  x  does  not  give  any 
fundamental  advantage  to  the  receiver  and  the  same  performance  can  be  achieved  as  long  as  x  is  known  to  the  sender. 
Indeed,  none  of  the  practical  embedding  algorithms  introduced  in  this  report  requires  the  knowledge  of  x  or  D  for  reading 
the  message. 

The  task  of  embedding  while  minimizing  distortion  can  assume  two  forms: 

•  Payload-limited  sender  (PLS):  embed  a  fixed  average  payload  of  m  bits  while  minimizing  the  average  distortion, 

(5.4)  minimize  EK[D]  subject  to  H(tt)  =  m. 

7T 

•  Distortion-limited  sender  (DLS):  maximize  the  average  payload  while  introducing  a  fixed  average  distortion 

A, 

(5.5)  maximize  H(tt)  subject  to  En[D]  = 

7T 

The  problem  of  embedding  a  fixed-size  message  while  minimizing  the  total  distortion  D  (the  PLS)  is  more  commonly 
used  in  steganography  wThen  compared  to  the  DLS.  When  the  distortion  function  is  content-driven,  the  sender  may  choose 
to  maximize  the  payload  with  a  constraint  on  the  overall  distortion.  This  DLS  corresponds  to  a  more  intuitive  use  of 
steganography  since  images  with  different  level  of  noise  and  texture  can  carry  different  amount  of  hidden  payload  and  thus 
the  distortion  should  be  fixed  instead  of  the  payload  (as  long  as  the  distortion  corresponds  to  statistical  detectability). 
The  fact  that  the  payload  is  driven  by  the  image  content  is  essentially  a  case  of  the  batch-steganography  paradigm  [50]. 
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5.2.1.  Performance  bounds  and  comparison  metrics.  Both  embedding  problems  described  above  bear  relationship  to  the 
problem  of  source  coding  with  a  fidelity  criterion  as  described  by  Shannon  [76]  and  the  problem  of  source  coding  with 
side  information  available  at  the  transmitter,  the  so-called  Gerfand-Pinsker  problem  [39].  Problems  (5.4)  and  (5.5)  are 
dual  to  each  other,  meaning  that  the  optimal  distribution  for  the  first  problem  is,  for  some  value  of  Dt,  also  optimal  for 
the  second  one.  Following  the  maximum  entropy  principle  [13,  Th.  12.1.1],  the  optimal  solution  has  the  form  of  a  Gibbs 
distribution  (see  Appendix  A  in  [31]  for  derivation): 


(5.6) 


_  exp(-AI>(y))  («)  TT  exp(-A /*(y<))  A  tt_  /..  % 

^y)  =  — z[ A) -  11 — TJT\ — -11™’ 


»=1 


Zi(  A) 


t*I 


where  the  parameter  A  €  [0,  oo)  is  obtained  from  the  corresponding  constraints  (5.4)  or  (5.5)  by  solving  an  algebraic 
equation;14  Z( A)  =  Ylyey  exp(-AD(y)),  Z{( A)  =  £yi€I.  exp(— Api(y*))  are  the  corresponding  partition  functions.  Step 
(a)  follows  from  the  additivity  of  D,  which  also  leads  to  mutual  independence  of  individual  stego  pixels  y*  given  x. 

By  changing  each  pixel  i  with  probability  7T*  (5.6)  one  can  simulate  embedding  -with  optimal  7r.  This  is  important 
for  steganography  developers  who  can  test  the  security  of  a  scheme  that  uses  the  pair  { 0>,  7r}  using  blind  steganalysis 
without  having  to  implement  a  practical  embedding  algorithm.  The  simulator  of  optimal  embedding  can  also  be  used 
to  assess  the  increase  in  statistical  detectability  of  a  practical  (suboptimal)  algorithm  w.r.t.  to  the  optimal  one.  This 
separation  principle  simplifies  the  search  for  better  distortion  measures  since  only  the  most  promising  approaches  can  be 
implemented.  The  simulators  are  used  to  benchmark  different  coding  algorithms  by  comparing  the  security  of  practical 
schemes  using  blind  steganalysis. 

An  established  way  of  evaluating  coding  algorithms  in  steganography  is  to  compare  the  embedding  efficiency  e(a)  = 
an/En[D]  (in  bits  per  unit  distortion)  for  a  fixed  expected  relative  payload  a  =  m/n  with  the  upper  bound  derived 
from  (5.6).  When  the  number  of  changes  is  minimized,  e  is  the  average  number  of  bits  hidden  per  embedding  change.  For 
general  functions  p»,  the  interpretation  of  this  metric  becomes  less  clear.  A  different  and  more  easily  interpretable  metric 
is  to  compare  the  payload,  m,  of  an  embedding  algorithm  w.r.t.  the  payload,  ttimax?  of  the  optimal  DLS  for  a  fixed  De, 


(5.7) 

which  will  be  called  the  coding  loss. 


l(De)  = 


771  max  ~  rn 
fRMAX 


5.2.2.  Binary  embedding  operation.  For  binary  embedding  operations,  it  is  enough  to  consider  a  slightly  narrower  class  of 
distortion  functions  without  experiencing  any  loss  of  generality.  The  binary  case  is  important  as  the  embedding  method 
described  here  can  be  extended  to  non-binary  operations  [20]. 

For  binary  embedding  with  2,*  =  {x*,x*},  x*  ^  xi}  let  p™m  =  min{p*(x,Xt),p*(x,Xi)}  and  Qi  =  |p*(x,Xi)  —  pi(x,xt)|  >  0. 
Eq.  (5.1)  can  now  be  rewritten  as: 

(5-8)  D(x,  y)  =  £  pfn  +  £>■  IP""  <  M*.  W)l- 

t=rl  t=l 

Because  the  first  sum  does  not  depend  on  y,  when  minimizing  D  over  y  it  is  enough  to  consider  only  the  second  term.  It 
now  becomes  clear  that  embedding  in  cover  x  while  minimizing  (5.8)  is  equivalent  to  embedding  in  cover  z 


(5.9) 

while  minimizing 

(5.10) 


= 


_fXi  when  =  Pt(x,  X,) 

[xi  when  p’nm  =  pt(x,Xi). 


£(z-y)  =  ^^(z’y<)  -  T4  zi)> 


i=  1 


t=l 


with  non-negative  costs  p»(z,  Zi)  =  0  <  pt(z,  Zi)  =  Qi  for  all  i  (when  the  cover  pixel  z,  is  changed  to  Zi,  the  distortion  D 
always  increases).  Thus,  from  now  on  binary  embedding  operations  will  always  consider  distortion  functions  of  the  form: 


(5.11) 

with  Qi  >  0. 


i=l 


14A  simple  binary  search  will  do  the  job  because  both  H( 7r)  and  Ett[D)  are  monotone  w.r.t.  A. 
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FIGURE  5.1.  Lower  bound  on  the  average  per-pixel  distortion,  E^[D)/n,  as  a  function  of  relative  payload 
a  for  different  distortion  profiles. 


For  example,  F5  [90]  uses  the  distortion  function  (5.11)  with  Qi  =  1  (the  number  of  embedding  changes),  while  nsF5  [37] 
employs  wet  paper  codes,  where  Qi  €  {l,oo}.  In  some  embedding  algorithms  [33,  55,  71],  where  the  cover  is  preprocessed 
and  quantized  before  embedding,  Qi  is  proportional  to  the  quantization  error  at  pixel 

Additionally,  for  binary  embedding  operations  one  introduces  the  so-called  distortion  profile  g  if  Qi  =  g(i/n)  for  all  i, 
where  q  is  a  non-decreasing15  function  q  :  [0,  l]  ->  [0,  K\.  The  following  distortion  profiles  are  of  interest  in  steganography 
(this  is  not  an  exhaustive  list):  the  constant  profile ,  q(x)  =  1,  when  all  pixels  have  the  same  impact  on  detectability  when 
changed;  the  linear  profile,  g(x)  =  2x,  when  the  distortion  is  related  to  a  quantization  error  uniformly  distributed  on 
[—  Q/2,Q/2]  for  some  quantization  step  Q  >  0;  and  the  square  profile ,  q(x)  =  3x2,  which  can  be  encountered  when  the 
distortion  is  related  to  a  quantization  error  that  is  not  uniformly  distributed. 

The  profile  q  is  usually  normalized  so  that  Eir[D]!n  «=  KiQi/n  =  0.5  when  embedding  a  full  payload  m  =  n.  With 
this  convention,  Figure  5.1  displays  the  lower  bounds  on  the  average  per-pixel  distortion  for  three  distortion  profiles. 

In  practice,  some  cover  pixels  may  require  Xi  =  {x*}  and  thus  Qi  =  oo  (the  so-called  wet  pixels  [33,  35,  37])  to  prevent 
the  embedding  algorithm  from  modifying  them.  Since  such  pixels  are  essentially  constant,  in  this  case  one  should  measure 
the  relative  payload  a  with  respect  to  the  set  of  dry  pixels  {xt|£i  <  oo),  i.e.,  a  =  m/|{xt|£t  <  oo}|.  The  overall  channel  is 
called  the  wet  paper  channel  and  it  is  characterized  by  the  profile  g  of  dry  pixels  and  relative  wetness  r  =  |{xi|pi  =  oo}|/n. 
The  wet  paper  channel  is  often  required  when  working  with  images  in  the  JPEG  domain  [37]. 


5.3.  Syndrome  coding.  The  PLS  and  the  DLS  can  be  realized  in  practice  using  a  general  methodology  called  syndrome 
coding .  This  section  briefly  reviews  this  approach  and  its  history  while  Section  5.4  explains  the  main  contribution  -  the 
syndrome-trellis  codes. 

Let  us  first  assume  a  binary  version  of  both  embedding  problems.  Let  V  :  Xi  -4  {0, 1}  be  a  parity  function  shared 
between  the  sender  and  the  receiver  satisfying  V(xi)  ^  V(yi)  such  as  V{x)  =  x  mod  2.  The  sender  and  the  receiver  need  to 
implement  the  embedding  and  extraction  mappings  defined  as  Emb  :  X  x  {0,  l}m  ->  y  and  Ext :  y  -4  {0,  l}m  satisfying 

Ext(Emb(x,  m))  =  m  Vx  €  X,Vm  €  {0,  l}m, 

respectively.  In  particular,  the  knowledge  of  the  distortion  function  D  at  the  receiver  is  not  assumed  and  thus  the 
embedding  scheme  can  be  seen  as  being  universal  in  this  sense.  A  common  information-theoretic  strategy  for  solving  the 
PLS  problem  is  known  as  binning,  which  is  here  implemented  using  cosets  of  a  linear  code.  Such  a  construction,  better 
known  as  syndrome  coding,  is  capacity  achieving  for  the  PLS  problem  if  random  linear  codes  are  used. 


15By  reindexing  the  pixels,  one  can  indeed  assume  that  Q\  <  02  <  *  •  •  <  Qn  <  K. 


FINAL  REPORT  FOR  CONTRACT  FA 9550-08- 1-0084  ENTITLED  “TOWARDS  STATISTICALLY  UNDETECTABLE  STEGANOGRAPHY*  35 


In  syndrome  coding,  the  embedding  and  extraction  mappings  are  realized  using  a  binary  linear  code  C  of  length  n  and 
dimension  n  -  m: 

(5.12)  Emb(x,  m)  =  arg  min  D(x ,  y), 

P(y)€C(m) 

(5.13)  Ext(y)  =ETP(y), 

where  V(y)  =  (P(yi), . . . ,  V(yn)),  H  G  {0,  l}mxn  is  a  parity-check  matrix  of  the  code  C,  C(m)  =  {z  G  {0,  l}n|Ez  =  m}  is 
the  coset  corresponding  to  syndrome  m,  and  all  operations  are  in  binary  arithmetic. 

Unfortunately,  random  linear  codes  are  not  practical  due  to  the  exponential  complexity  of  the  optimal  binary  coset 
quantizer  (5.12),  which  is  the  most  challenging  part  of  the  problem.  STCs  form  a  rich  class  of  codes  for  which  the  quantizer 
can  be  solved  optimally  with  linear  time  and  space  complexity  w.r.t.  n. 

Since  the  DLS  is  a  dual  problem  to  the  PLS,  it  can  be  solved  by  (5.12)  and  (5.13)  once  an  appropriate  message  size  rn 
is  known.  This  can  be  obtained  in  practice  by  m  =  mMAx(l  —  /'),  where  77imax  =  H(n\)  is  the  maximal  average  payload 
obtained  from  the  optimal  distribution  (5.6)  achieving  average  distortion  D€  and  l'  is  an  experimentally-obtained  coding 
loss  one  expects  the  algorithm  will  achieve. 

5.3.1.  Prior  Art.  The  problem  of  minimizing  the  embedding  impact  in  steganography,  introduced  above  as  the  PLS 
problem,  has  been  already  conceptually  described  by  Crandall  [15]  in  his  essay  posted  on  the  steganography  mailing  list 
in  1998.  He  suggested  that  whenever  the  encoder  embeds  at  most  one  bit  per  pixel,  it  should  make  use  of  the  embedding 
impact  defined  for  every  pixel  and  minimize  its  total  sum: 

“Conceptually,  the  encoder  examines  an  area  of  the  image  and  weights  each  of  the  options  that  allow  it 
to  embed  the  desired  bits  in  that  area.  It  scores  each  option  for  how  conspicuous  it  is  and  chooses  the 
option  with  the  best  score.71 

Later,  Bierbrauer  [3,  4]  studied  a  special  case  of  this  problem  and  described  a  connection  between  codes  (not  necessarily 
linear)  and  the  problem  of  minimizing  the  number  of  changed  pixels  (the  constant  profile).  This  connection,  which 
has  become  known  as  matrix  embedding  (encoding),  was  made  famous  among  steganographers  by  Westfeld  [90]  who 
incorporated  it  in  his  F5  algorithm.  A  binary  Hamming  code  was  used  to  implement  the  syndrome-coding  scheme  for  the 
constant  profile.  Later  on,  different  authors  suggested  other  linear  codes,  such  as  Golay  [83],  BCH  [75],  random  codes  of 
small  dimension  [38],  and  non-linear  codes  based  on  the  idea  of  a  blockwise  direct  sum  [4].  Current  state-of-the-art  methods 
use  codes  based  on  Low  Density  Generator  Matrices  (LDGMs)  [31]  in  combination  with  the  ZZW  construction  [95].  The 
embedding  efficiency  of  these  codes  stays  rather  close  to  the  bound  for  arbitrarily  small  relative  payloads  [28]. 

The  versatile  syndrome-coding  approach  can  also  be  used  to  communicate  via  the  wet  paper  channel  using  the  so-called 
wet  paper  codes  [33].  Wet  paper  codes  minimizing  the  number  of  changed  dry  pixels  were  described  in  [34,  75,  97,  22]. 

Even  though  other  distortion  profiles,  such  as  the  linear  profile,  are  of  great  interest  to  steganography,  no  general 
solution  with  performance  close  to  the  bound  is  currently  known.  The  authors  of  [55]  approached  the  PLS  problem  by 
minimizing  the  distortion  on  a  block-by-block  basis  utilizing  a  Hamming  code  and  a  suboptimal  quantizer  implemented 
using  a  brute-force  search  that  allows  up  to  three  embedding  changes.  Such  an  approach,  however,  provides  highly 
suboptimal  performance  far  from  the  theoretical  bound  (see  Figure  5.8).  A  similar  approach  based  on  BCH  codes  and  a 
brute-force  quantizer  was  described  in  [71]  achieving  a  slightly  better  performance  than  Hamming  codes.  Neither  Hamming 
or  BCH  codes  can  be  used  to  deal  with  the  wet  paper  channel  without  significant  performance  loss.  To  the  best  of  PPs, 
no  solution  is  known  that  could  be  used  to  solve  the  PLS  problem  with  arbitrary  distortion  profile  containing  wet  pixels. 

5.4.  Syndrome- Trellis  Codes.  In  this  section,  the  focus  is  on  solving  the  binary  PLS  problem  with  distortion  func¬ 
tion  (5.10).  The  resulting  codes  are  called  the  syndrome-trellis  codes.  These  codes  will  serve  as  a  building  block  for 
non-binary  PLS  and  DLS  problems  as  described  in  [20]. 

The  construction  behind  STCs  is  not  new  from  an  information-theoretic  perspective,  since  the  STCs  are  convolutional 
codes  represented  in  a  dual  domain.  However,  STCs  are  very  interesting  for  practical  steganography  since  they  allow 
solving  both  embedding  problems  with  a  very  small  coding  loss  over  a  wide  range  of  distortion  profiles  even  with  wet 
pixels.  The  same  code  can  be  used  with  all  profiles  making  the  embedding  algorithm  practically  universal.  STCs  offer 
general  and  state-of-the-art  solution  for  both  embedding  problems  in  steganography.  Here,  the  PI  gives  the  description 
of  the  codes  along  with  their  graphical  representation,  the  syndrome  trellis.  Such  construction  is  prepared  for  the  Viterbi 
algorithm,  which  is  optimal  for  solving  (5.12).  Important  practical  guidelines  for  optimizing  the  codes  and  using  them 
for  the  wet  paper  channel  are  also  covered.  Finally,  the  PI  studies  the  performance  of  these  codes  by  extensive  numerical 
simulations  using  different  distortion  profiles  including  the  wet  paper  channel. 
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Figure  5.2.  Example  of  a  parity-check  matrix  H  formed  from  the  submatrix  H  (h  =  2,w  =  2)  and 
its  corresponding  syndrome  trellis.  The  last  h  —  1  submatrices  in  H  are  cropped  to  achieve  the  desired 
relative  payload  a.  The  syndrome  trellis  consists  of  repeating  blocks  of  w  4- 1  columns,  where  “po”  and 
“p»*\  i  >  0,  denote  the  starting  and  pruning  columns,  respectively.  The  column  labeled  l  G  {1,2,...} 
corresponds  to  the  Ith  column  in  the  parity-check  matrix  H. 


The  main  goal  is  to  develop  efficient  syndrome-coding  schemes  for  an  arbitrary  relative  payload  a  with  the  main  focus 
on  small  relative  payloads  (think  of  a  <  1/2  for  example).  In  steganography,  the  relative  payload  must  decrease  with 
increasing  size  of  the  cover  object  in  order  to  maintain  the  same  level  of  security,  which  is  a  consequence  of  the  square  root 
law  [27].  Moreover,  recent  results  from  steganalysis  in  both  spatial  [64]  and  DCT  domains  [57]  suggest  that  the  secure 
payload  for  digital  image  steganography  is  always  far  below  1/2.  Another  reason  for  targeting  smaller  payloads  is  the  fact 
that  as  a  1,  all  binary  embedding  algorithms  tend  to  introduce  changes  with  probability  1/2,  no  matter  how  optimal 
they  are.  Denoting  with  R  =  (n  —  m)/n  the  rate  of  the  linear  code  C,  then  a  -*  0  translates  to  R  =  1  —  a  — ►  1,  which  is 
characteristic  for  applications  of  syndrome  coding  in  steganography. 

5.4.1.  From  convolutional  codes  to  syndrome-trellis  codes .  Since  Shannon  [76]  introduced  the  problem  of  source  coding 
with  a  fidelity  criterion  in  1959,  convolutional  codes  were  probably  the  first  “practical”  codes  used  for  this  problem  [86]. 
This  is  because  the  gap  between  the  bound  on  the  expected  per-pixel  distortion  and  the  distortion  obtained  using  the 
optimal  encoding  algorithm  (the  Viterbi  algorithm)  decreases  exponentially  with  the  constraint  length  of  the  code  [86,  44]. 
The  complexity  of  the  Viterbi  algorithm  is  linear  in  the  block  length  of  the  code,  but  exponential  in  its  constraint  length 
(the  number  of  trellis  states  grows  exponentially  in  the  constraint  length). 

When  adapted  to  the  PLS  problem,  convolutional  codes  can  be  used  for  syndrome  coding  since  the  best  stego  image  in 
(5.12)  can  be  found  using  the  Viterbi  algorithm.  This  makes  convolutional  codes  (of  small  constraint  length)  suitable  for 
steganography  because  the  entire  cover  object  can  be  used  and  the  speed  can  be  traded  for  performance  by  adjusting  the 
constraint  length.  Note  that  the  receiver  does  not  need  to  know  D  since  only  the  Viterbi  algorithm  requires  this  knowledge. 
By  increasing  the  constraint  length,  one  can  achieve  the  average  per-pixel  distortion  that  is  arbitrarily  close  to  the  bounds 
and  thus  make  the  coding  loss  (5.7)  approach  zero.  Convolutional  codes  are  often  represented  with  shift-registers  (see 
Chapter  48  in  [59])  that  generate  the  codeword  from  a  set  of  information  bits.  In  channel  coding,  codes  of  rates  R  =  1/k 
for  k  =  2, 3, . . .  are  usually  considered  for  their  simple  implementation. 

Convolutional  codes  in  standard  trellis  representation  are  commonly  used  in  problems  that  are  dual  to  the  PLS  problem, 
such  as  the  distributed  source  coding.  The  main  drawback  of  convolutional  codes,  when  implemented  using  shift-registers, 
comes  from  our  requirement  of  small  relative  payloads  (code  rates  close  to  one)  which  is  specific  to  steganography.  A 
convolutional  code  of  rate  R  =  (k  —  l)/fc  requires  k  —  1  shift  registers  in  order  to  implement  a  scheme  for  a  =  1  /k. 
Here,  unfortunately,  the  complexity  of  the  Viterbi  algorithm  in  this  construction  grows  exponentially  with  k.  Instead  of 
using  puncturing  (see  Chapter  48  in  [59]),  which  is  often  used  to  construct  high-rate  convolutional  codes,  the  PI  prefers 
to  represent  the  convolutional  code  in  the  dual  domain  using  its  parity-check  matrix.  In  fact,  Sidorenko  and  Zyablov  [78] 
showed  that  optimal  decoding  of  convolutional  codes  (our  binary  quantizer)  with  rates  R  =  (k  —  l)/k  can  be  carried  out 
in  the  dual  domain  on  the  syndrome  trellis  with  a  much  lower  complexity  and  without  any  loss  of  performance.  This 
approach  is  more  efficient  as  a  — >  0  and  thus  is  choosen  for  the  construction. 

In  the  dual  domain,  a  code  of  length  n  is  represented  by  a  parity-check  matrix  instead  of  a  generator  matrix  as  is  more 
common  for  convolutional  codes.  Working  directly  in  the  dual  domain  allows  the  Viterbi  algorithm  to  exactly  implement 
the  coset  quantizer  required  for  the  embedding  function  (5.12).  The  message  can  be  extracted  in  a  straightforward  manner 
by  the  recipient  using  the  shared  parity-check  matrix. 

5.5.  Description  of  syndrome- trellis  codes.  Although  syndrome-trellis  codes  form  a  class  of  convolutional  codes  and 
thus  can  be  described  using  a  classical  approach  with  shift-registers,  it  is  advantageous  to  stay  in  the  dual  domain  and 
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_  Forward  part  of  the  Viterbi  algorithm  _ 

wght [0]  =  0 

wght [1 , . . . ,2~h-l]  *  infinity 
indx  =  indm  =  1 

for  i  *  la...,num  of  blocks  (submatrices  in  H)  { 
for  j  *  1, . . . ,w  {  //  for  each  column 

for  k  -  0,...,2~h-l  {  //for  each  state 

vO  *  wght[k]  +  x [indx] *r ho [indx] 
wl  «  wght[k  XOR  H_hat[j]]  +  ( 1-x [indx] ) *rho [indx] 
path  [indx] [k]  *  wl  <  vO  ?  1  :  0  //  C  notation 

newghtfk]  =  min(vO,  wl) 

> 

indx++ 

wght  =  newwght 

> 

//  prune  states 
for  j  =  0, . . . ,2~ (h-l)-l 
wght[j]  =  wght [2* j  +  message [indm]] 
wght [2~(h-l) f . . . f 2~h-l]  *  infinity 
i nHm++ 

> 


_  Backward  part  of  the  Viterbi  alg. 

1  embedding_cost  =  wght[0) 

2  state  =  0,  indx — ,  indm — 

3  for  i  =  num  of  blocks,... ,1  (step  -1)  { 

4  for  j  =  w,...,l  (step  -1)  { 

s  y [indx]  =  path [indx] [state] 

6  state  *  state  XOR  (y [indx] *H_hat [j] ) 

7  indx — ' 

8  > 

9  state  *  2*state  +  message [indm] 

10  indm — 

11  > 


_  Legend  _ 

INPUT:  x,  message,  H_hat 

x  =  (x[l]  , .  .  .  ,x[n]  )  cover  object 
message  *  (message [1] .... , message [m] ) 
H_hat[j]  *  j  th  column  in  int  notation 

OUTPUT:  y,  embedding_cost 

y  *  (y[l] , . .  .  ,y[n])  stego  object 


FIGURE  5.3.  Pseudocode  of  the  Viterbi  algorithm  modified  for  the  syndrome  trellis. 


describe  the  code  directly  by  its  parity-check  matrix.  The  parity-check  matrix  He{0,l}mXnofa  binary  syndrome-trellis 
code  of  length  n  and  codimension  m  is  obtained  by  placing  a  small  submatrix  H  of  size  h  x  w  along  the  main  diagonal 
as  in  Figure  5.2.  The  submatrices  H  are  placed  next  to  each  other  and  shifted  down  by  one  row  leading  to  a  sparse  and 
banded  H.  The  height  h  of  the  submatrix  (called  the  constraint  height)  is  a  design  parameter  that  affects  the  algorithm 
speed  and  efficiency  (typically,  6  <  h  <  15).  The  width  of  H  is  dictated  by  the  desired  ratio  of  m/n,  which  coincides 
with  the  relative  payload  a  =  m/n  when  no  wet  pixels  are  present.  If  m/n  equals  to  1/k  for  some  k  E  N,  select  w  =  k. 
For  general  ratios,  find  k  such  that  l/(fc-b  1)  <  m/n  <  l/k.  The  matrix  H  will  contain  a  mix  of  submatrices  of  width  k 
and  k  - b  1  so  that  the  final  matrix  H  is  of  size  m  x  n.  In  this  way,  one  can  create  a  parity-check  matrix  for  an  arbitrary 
message  and  code  size.  The  submatrix  H  acts  as  an  input  parameter  shared  between  the  sender  and  the  receiver  and  its 
choice  is  discussed  in  more  detail  in  Section  5.5.2.  For  the  sake  of  simplicity,  in  the  following  description  it  is  assumed 
that  m/n  =  l/w  and  thus  the  matrix  H  is  of  the  size  b  x  (b  •  tu)?  where  b  is  the  number  of  copies  of  H  in  H. 

Similar  to  convolutional  codes  and  their  trellis  representation,  every  codeword  of  an  STC  C  =  { z  €  {0,  l}n|Hz  =  0} 
can  be  represented  as  a  unique  path  through  a  graph  called  the  syndrome  trellis.  Moreover,  the  syndrome  trellis  is 
parametrized  by  m  and  thus  can  represent  members  of  arbitrary  coset  C(m)  =  {z6  {0,  l}n]Hz  =  m}.  An  example  of 
the  syndrome  trellis  is  shown  in  Figure  5.2.  More  formally,  the  syndrome  trellis  is  a  graph  consisting  of  b  blocks,  each 
containing  2h(w  +  1)  nodes  organized  in  a  grid  of  w  +  1  columns  and  2h  rows.  The  nodes  between  two  adjacent  columns 
form  a  bipartite  graph,  i.e.,  all  edges  only  connect  nodes  from  two  adjacent  columns.  Each  block  of  the  trellis  represents 
one  submatrix  M  used  to  obtain  the  parity-check  matrix  H.  The  nodes  in  every  column  are  called  states. 

Each  z  E  {0,  l}n  satisfying  Hz  =  m  is  represented  as  a  path  through  the  syndrome  trellis  wThich  represents  the  process 
of  calculating  the  syndrome  as  a  linear  combination  of  the  columns  of  H  with  weights  given  by  z.  Each  path  starts  in  the 
leftmost  all-zero  state  in  the  trellis  and  extends  to  the  right.  The  path  shows  the  step-by-step  calculation  of  the  (partial) 
syndrome  using  more  and  more  bits  of  z.  For  example,  the  first  two  edges  in  Figure  5.2,  that  connect  the  state  00  from 
column  po  with  states  11  and  00  in  the  next  column,  correspond  to  adding  (P(y\)  =  1)  or  not  adding  (“P(yi)  =  0)  the 
first  column  of  H  to  the  syndrome,  respectively.16  At  the  end  of  the  first  block,  all  paths  for  which  the  first  bit  of  the 
partial  syndrome  does  not  match  m\  are  terminated.  This  way,  one  obtains  a  new  column  of  the  trellis,  which  will  serve 
as  the  starting  column  of  the  next  block.  This  column  merely  illustrates  the  transition  of  the  trellis  from  representing 
the  partial  syndrome  (si, . . . ,  $h )  to  (S2, . . . ,  s/»+ 1).  This  operation  is  repeated  at  each  block  transition  in  the  matrix  H 
and  guarantees  that  2h  states  are  sufficient  to  represent  the  calculation  of  the  partial  syndrome  throughout  the  whole 
syndrome  trellis. 


16The  state  corresponds  to  the  partial  syndrome. 
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FIGURE  5.4.  Embedding  efficiency  of  300  random  syndrome-trellis  codes  satisfying  the  design  rules  for 
relative  payload  a  =  1/2  and  constraint  height  h  =  10.  All  codes  were  evaluated  by  the  Viterbi  algorithm 
with  a  random  cover  object  of  n  =  106  pixels  and  a  random  message  on  the  constant,  linear,  and  square 
profiles.  Codes  are  shown  in  the  order  determined  by  their  embedding  efficiency  evaluated  on  the  constant 
profile.  This  experiment  suggests  that  codes  good  for  the  constant  profile  are  good  for  other  profiles.  Codes 
designed  for  different  relative  payloads  have  a  similar  behavior. 

To  find  the  closest  stego  object,  weights  are  assigned  to  all  trellis  edges.  The  weights  of  the  edges  entering  the  column 
with  label  Z,  Z  €  {l,...,n},  in  the  syndrome  trellis  depend  on  the  Zth  bit  representation  of  the  original  cover  object  x, 
V(xi ).  If  V(xi)  =  0,  then  the  horizontal  edges  (corresponding  to  not  adding  the  Zth  column  of  0)  have  a  weight  of  0 
and  the  edges  corresponding  to  adding  the  Zth  column  of  0  have  a  weight  of  p/.  If  V{x{)  =  1,  the  roles  of  the  edges  are 
reversed.  Finally,  all  edges  connecting  the  individual  blocks  of  the  trellis  have  zero  weight. 

The  embedding  problem  (5.12)  for  binary  embedding  can  now  be  optimally  solved  by  the  Viterbi  algorithm  with  time 
and  space  complexity  0(2hn).  This  algorithm  consists  of  two  parts,  the  forward  and  the  backward  part.  The  forward  part 
of  the  algorithm  consists  of  n  +  b  steps.  Upon  finishing  the  ith  step,  one  knows  the  shortest  path  between  the  leftmost 
all-zero  state  and  every  state  in  the  tth  column  of  the  trellis.  Thus  in  the  final,  n  4-  6th  step,  one  discovers  the  shortest 
path  through  the  entire  trellis.  During  the  backward  part,  the  shortest  path  is  traced  back  and  the  parities  of  the  closest 
stego  object  V(y)  are  recovered  from  the  edge  labels.  The  Viterbi  algorithm  modified  for  the  syndrome  trellis  is  described 
in  Figure  5.3  using  a  pseudocode. 

5.5.1.  Implementation  details .  The  construction  of  STCs  is  not  constrained  to  having  to  repeat  the  same  submatrix  0 
along  the  diagonal.  Any  parity-check  matrix  0  containing  at  most  h  nonzero  entries  along  the  main  diagonal  will  have 
an  efficient  representation  by  its  syndrome  trellis  and  the  Viterbi  algorithm  will  have  the  same  complexity  0(2hn).  In 
practice,  the  trellis  is  built  on  the  fly  because  only  the  structure  of  the  submatrix  0  is  needed  (see  the  pseudocode  in 
Figure  5.3).  As  can  be  seen  from  the  last  two  columns  of  the  trellis  in  Figure  5.2,  the  connectivity  between  trellis  columns 
is  highly  regular  which  can  be  used  to  speed  up  the  implementation  by  “vectorizing”  the  calculations. 

In  the  forward  part  of  the  algorithm,  one  needs  to  store  one  bit  (the  label  of  the  incoming  edge)  to  be  able  to  reconstruct 
the  path  in  the  backward  run.  This  space  complexity  is  linear  and  should  not  cause  any  difficulty,  since  for  h  —  10,  n  =  106, 
the  total  of  2 10  *  106/8  bytes  («  122MB)  of  space  is  required.  If  less  space  is  available,  one  can  always  run  the  algorithm 
on  smaller  blocks,  say  n  =  104,  without  any  noticeable  performance  drop.  If  one  is  only  interested  in  the  total  distortion 
D{y)  and  not  the  stego  object  itself,  this  information  does  not  need  to  be  stored  at  all  and  only  the  forward  run  of  the 
Viterbi  algorithm  is  required. 

5.5.2.  Design  of  good  syndrome-trellis  codes.  A  natural  question  regarding  practical  applications  of  syndrome-trellis  codes 
is  how  to  optimize  the  structure  of  0  for  fixed  parameters  h  and  w  and  a  given  profile.  If  0  depended  on  the  distortion 
profile,  the  profile  would  have  to  be  somehow  communicated  to  the  receiver.  Fortunately,  this  is  not  the  case  and  a 
submatrix  0  optimized  for  one  profile  seems  to  be  good  for  other  profiles  as  well.  In  this  section,  the  PI  studies  these 
issues  experimentally  and  describes  a  practical  algorithm  for  obtaining  good  submatrices. 

Let  us  suppose  that  the  goal  is  to  design  a  submatrix  0  of  size  h  x  w  for  a  given  constraint  height  h  and  relative 
payload  a  =  l/w.  The  PI  was  unable  to  find  a  better  algorithm  than  an  exhaustive  search  guided  by  some  simple  design 
rules.  First,  H  should  not  have  identical  columns  because  the  syndrome  trellis  would  contain  two  or  more  different  paths 
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FIGURE  5.5.  Average  number  of  wet  pixels  out  of  n  =  106  that  need  to  be  changed  to  find  a  solution  to 
(5.12)  using  STCs  with  ft  =  11. 


with  exactly  the  same  weight,  which  would  lead  to  an  overall  decrease  in  performance.  By  running  an  exhaustive  search 
over  small  matrices,  the  PI  observed  that  the  best  submatrices  H  had  ones  in  the  first  and  last  rows.  For  example,  when 
ft  =  7  and  w  =  4,  more  than  97%  of  the  best  1000  codes  obtained  from  the  exhaustive  search  satisfied  this  rule.  Thus, 
the  search  for  good  matrices  was  limited  to  those  that  did  not  contain  identical  columns  and  with  all  bits  in  the  first 
and  last  rows  set  to  1  (the  remaining  bits  were  assigned  at  random).  In  practice,  one  can  randomly  generate  10  —  1000 
submatrices  satisfying  these  rules  and  estimate  their  performance  (embedding  efficiency)  experimentally  by  running  the 
Viterbi  algorithm  with  random  covers  and  messages.  For  a  reliable  estimate,  cover  objects  of  size  at  least  n  =  106  are 
required. 

To  investigate  the  stability  of  the  design  w.r.t.  to  the  profile,  the  following  experiment  w^as  conducted.  The  PI  fixed 
ft  =  10  and  w  =  2,  which  correspond  to  a  code  with  a  =  1/2.  The  code  design  procedure  was  simulated  by  randomly 
generating  300  submatrices  Hi, ... ,  H300  satisfying  the  above  design  rules.  The  goodness  of  the  code  was  evaluated  using 
the  embedding  efficiency  (e  =  m/Z)(x,y))  by  running  the  Viterbi  algorithm  on  a  random  cover  object  (of  size  n  =  106) 
and  with  a  random  message.  This  wras  repeated  independently  for  all  three  profiles  from  Section  5.2.2.  Figure  5.4  shows 
the  embedding  efficiency  after  ordering  all  300  codes  by  their  performance  on  the  constant  profile.  Because  the  codes  with 
a  high  embedding  efficiency  on  the  constant  profile  exhibit  high  efficiency  for  the  other  profiles,  the  code  design  appears 
stable  w.r.t.  the  profile  and  these  matrices  can  be  used  with  other  profiles  in  practice.  All  further  results  are  generated 
by  using  these  matrices. 

5.5.3.  Wet  paper  channel.  In  this  section,  the  PI  investigates  how  STCs  can  be  used  for  the  wet  paper  channel  described 
by  relative  wetness  r  =  |{i|pi  =  oo}|/n  with  a  given  distortion  profile  of  dry  pixels.  Although  the  STCs  can  be  directly 
applied  to  this  problem,  the  probability  of  not  being  able  to  embed  a  message  without  changing  any  wet  pixel  may  be 
positive  and  depends  on  the  number  of  w^et  pixels,  the  payload,  and  the  code.  The  goal  is  to  make  this  probability  very 
small  or  to  make  sure  that  the  number  of  wet  pixels  that  must  be  changed  is  small  (e.g.,  one  or  tw*o).  Two  different 
approaches  are  now  described  to  address  this  problem. 

Let  us  assume  that  the  wet  channel  is  iid  with  probability  of  a  pixel  being  wet  0  <  r  <  1.  This  assumption  is  plausible 
because  the  cover  pixels  can  be  permuted  using  a  stego  key  before  embedding.  For  the  wet  paper  channel,  the  relative 
payload  is  defined  w.r.t.  the  dry  pixels  as  a  =  m/\{i\pi  <  oo}|.  When  designing  the  code  for  the  wet  paper  channel 
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Figure  5.6.  Effect  of  relative  wetness  r  of  the  wet  paper  channel  with  a  constant  profile  on  the  embedding 
efficiency  of  STCs.  The  distortion  was  calculated  w.r.t.  the  changed  dry  pixels  only  and  a  =  m/(n  —  rn). 
Each  point  was  obtained  by  quantizing  a  random  vector  of  n  =  106  pixels. 


Figure  5.7.  Comparison  of  the  coding  loss  of  STCs  as  a  function  of  the  profile  exponent  d  for  different 
payloads  and  constraint  heights  of  STCs.  Each  point  was  obtained  by  quantizing  a  random  vector  of 
n  =  106  pixels. 


with  n-pixel  covers,  relative  wetness  r,  and  desired  relative  payload  a,  the  parity-check  matrix  H  has  to  be  of  the  size 
[(1  —  r)an]  x  n. 

The  random  permutation  makes  the  Viterbi  algorithm  less  likely  to  fail  to  embed  a  message  without  having  to  change 
some  wet  pixels.  The  probability  of  failure,  pw,  decreases  with  decreasing  a  and  r  and  it  also  depends  on  the  constraint 
height  h.  From  practical  experiments  with  n  =  106  cover  pixels,  r  =  0.8,  and  h  =  10,  the  PI  estimated  from  1000 
independent  runs  pw  =  0.24  for  a  =  1/2,  pw  =  0.009  for  a  =  1/4,  and  pw  =  0  for  a  =  1/10.  In  practice,  the  message 
size  m  can  be  used  as  a  seed  for  the  pseudo-random  number  generator.  If  the  embedding  process  fails,  embedding  m  —  1 
bits  leads  to  a  different  permutation  while  embedding  roughly  the  same  amount  of  message.  In  k  trials,  the  probability  of 
having  to  modify  a  wet  pixel  is  at  most  p* ,  which  can  be  made  arbitrarily  small. 

Alternatively,  the  sender  may  allow  a  small  number  of  wet  pixels  to  be  modified,  say  one  or  two,  without  affecting  the 
statistical  detectability  in  any  significant  manner.  Making  use  of  this  fact,  one  can  set  the  distortion  of  all  wet  cover  pixels 
to  Qi  =  C,  C  >  52ei<00  Qi  and  &  =  Qi  for  i  dry.  The  weight  c  of  the  best  path  through  the  syndrome  trellis  obtained 
by  the  Viterbi  algorithm  with  distortion  &  can  be  written  in  the  form  c  =  ncC  4*  d ,  where  nc  is  the  smallest  number  of 
wet  cover  pixels  that  had  to  be  changed  and  d  is  the  smallest  weight  of  the  path  over  the  pixels  that  are  allowed  to  be 
changed. 


Embedding  efficiency  e  Embedding  efficiency  e  Embedding  efficiency  e 
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Figure  5.8.  Embedding  efficiency  and  coding  loss  of  syndrome-trellis  codes  for  three  distortion  profiles. 
Each  point  was  obtained  by  running  the  Viterbi  algorithm  with  n  =  106  cover  pixels.  Hamming  [55]  and 
BCH  [93]  codes  were  applied  on  a  block-by-block  basis  on  cover  objects  with  n  =  105  pixels  with  a  brute- 
force  search  making  up  to  three  and  four  changes,  respectively.  The  line  connecting  a  pair  of  Hamming 
or  BCH  codes  represents  the  codes  obtained  by  their  block  direct  sum.  For  clarity,  the  PI  presents  the 
coding  loss  results  in  the  range  a  €  [0.5, 1]  only  for  constraint  height  h  =  10  of  the  syndrome-trellis  codes. 


FINAL  REPORT  FOR  CONTRACT  FA9550-08- 1-0084  ENTITLED  "TOWARDS  STATISTICALLY  UNDETECTABLE  STEGANOGRAPHY"  42 


o 


JCt 

2 


3 

CL 

JC 

CD 

3 

O 


T3 


& 

c 

.22 

o 

e 

CD 

c 


“O 

T3 

O 

-O 

e 

w 


5  3 


upper  bound  e  <  =  4.54 


.  I  t  I  f  I  t  !  f  f  ! 

i  |  s  HI  2  2  2  2  2  2 

A  * 


•  ft  =  12 

•  ft  =  10 
•ft  =  9 

♦  ft  =  8 
oft  =  7 

♦  ft  =  6 

"0  200  400  600  800  1J)00 

Code  length  n 


FIGURE  5.9.  Results  for  the  syndrome-trellis  codes  designed  for  relative  payload  a  =  1/2.  Left:  Average 
number  of  cover  pixels  (xlO6)  quantized  per  second  (throughput)  shown  for  different  constraint  heights 
and  two  different  implementations.  Right:  Average  embedding  efficiency  for  different  code  lengths  n  (the 
number  of  cover  pixels),  constraint  heights  ft,  and  a  constant  distortion  profile.  Codes  of  length  n  >  1000 
have  similar  performance  as  for  n  =  1000.  Each  point  was  obtained  as  an  average  over  1000  samples. 

Figure  5.5  shows  the  average  number  of  wet  pixels  out  of  n  =  106  required  to  be  changed  in  order  to  solve  (5.12)  for 
STCs  with  ft  =  11.  The  exact  value  of  Qi  is  irrelevant  in  this  experiment  as  long  as  it  is  finite.  This  experiment  suggests 
that  STCs  can  be  used  with  arbitrary  r  as  long  as  a  <  0.7.  As  can  be  seen  from  Figure  5.6,  increasing  the  amount  of  wet 
pixels  does  not  lead  to  any  noticeable  difference  in  embedding  efficiency  for  constant  profile.  Similar  behavior  has  been 
observed  for  other  profiles  and  holds  as  long  as  the  number  of  changed  wet  pixels  is  small. 

5.5.4.  Experimental  results.  The  PI  has  implemented  the  Viterbi  algorithm  in  C++  and  optimized  its  performance  by 
using  Streaming  SIMD  Extensions  instructions.  Based  on  the  distortion  profile,  the  algorithm  chooses  between  the  float 
and  1  byte  unsigned  integer  data  type  to  represent  the  weight  of  the  paths  in  the  trellis.  The  following  results  were 
obtained  using  an  Intel  Core2  X6800  2.93GHz  CPU  machine  utilizing  a  single  CPU  core. 

Using  the  search  described  in  Section  5.5.2,  the  PI  found  good  syndrome-trellis  codes  of  constraint  height  ft  €  {6, ... ,  12} 
for  relative  payloads  a  =  1/w,  w  E  {1,...,20}.  Some  of  these  codes  can  be  found  in  (26,  Table  1].  In  practice,  almost 
every  code  satisfying  the  design  rules  is  equally  good.  This  fact  can  also  be  seen  from  Figure  5.4,  where  300  random  codes 
are  evaluated  over  different  profiles. 

The  effect  of  the  profile  shape  on  the  coding  loss  for  g(x)  as  a  function  of  d  is  shown  in  Figure  5.7.  The  coding 
loss  increases  with  decreasing  relative  payload  a.  This  effect  can  be  compensated  by  using  a  larger  constraint  height  ft. 

Figure  5.8  shows  the  comparison  of  syndrome- trellis  codes  for  three  profiles  with  other  codes  which  are  known  for  a 
given  profile.  The  ZZW  family  [96  applies  only  to  the  constant  profile.  For  a  given  relative  payload  a  and  constraint 
height  ft,  the  same  submatrix  H  was  used  for  all  profiles.  This  demonstrates  the  versatility  of  the  proposed  construction, 
since  the  information  about  the  profile  does  not  need  to  be  shared,  or,  perhaps  more  importantly,  the  profile  does  not 
need  to  be  known  a  priori  for  a  good  performance. 

Figure  5.9  shows  the  average  throughput  (the  number  of  cover  pixels  n  quantized  per  second)  based  on  the  used 
data  type.  In  practice,  1-5  seconds  were  enough  to  process  a  cover  object  with  n  =  106  pixels.  The  same  figure  shows 
the  embedding  efficiency  obtained  from  very  short  codes  for  the  constant  profile.  This  result  indicates  that  the  average 
performance  of  syndrome-trellis  codes  quickly  approaches  its  maximum  w.r.t.  n.  This  is  again  an  advantage,  since  some 
applications  may  require  short  blocks. 

5.6.  Practical  Embedding  Constructions.  In  this  section,  the  PI  shows  some  applications  of  the  proposed  methodol¬ 
ogy  for  spatial  and  transform  domain  (JPEG)  steganography.  In  the  past,  most  embedding  schemes  were  constrained  by 
practical  ways  of  how  to  encode  the  message  so  that  the  receiver  can  read  it.  Problems  such  as  “shrinkage”  in  F5  [90,  37] 
or  in  MMx  [55]  arose  from  this  practical  constraint.  By  being  able  to  solve  the  PLS  and  DLS  problems  close  to  the  bound 
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Relative  payload  a  (bits  per  non-zero  AC  coefF.) 


Figure  5.10.  Comparison  of  methods  with  four  different  weight-assignment  strategies  S1-S4  and  nsF5 
as  described  in  Section  5.6.1  when  simulated  as  if  the  best  coding  scheme  was  available.  The  performance 
of  strategy  S4  when  practically  implemented  using  STCs  with  h  =  8  and  h  =  11  is  also  shown. 

for  an  arbitrary  additive  distortion  function,17  steganographers  now  have  much  more  freedom  in  designing  new  embedding 
algorithms.  They  only  need  to  select  the  distortion  function  and  then  apply  the  proposed  framework.  The  only  task  left 
to  the  steganographer  is  the  choice  of  the  distortion  function  D .  It  should  be  selected  so  that  it  correlates  with  statistical 
detectability.  Instead  of  delving  into  the  difficult  problem  of  how  to  select  the  best  D,  the  PI  now  provides  a  few  examples 
of  additive  distortion  measures  motivated  by  recent  developments  in  steganography  and  shows  their  performance  when 
blind  steganalysis  is  used. 

In  the  examples  below,  the  PI  tested  the  embedding  schemes  using  blind  feature-based  steganalysis  on  a  large  database 
of  images.  The  image  database  was  evenly  divided  into  a  training  and  a  testing  set  of  cover  and  stego  images,  respectively. 
A  soft-margin  support- vector  machine  was  trained  using  the  Gaussian  kernel.  The  kernel  width  and  the  penalty  parameter 
were  determined  using  five-fold  cross  validation  on  the  grid  (C,  7)  E  { (10* ,  2>~d)\k  E  {—3, . . . ,  4},  j  E  { -3, . . . ,  3}  } ,  where 
d  is  the  binary  logarithm  of  the  number  of  features.  The  results  are  reported  using  a  measure  frequently  used  in  steganalysis 
-  the  minimum  average  classification  error 

(5.14)  Pe  =  min(/VA  +  A4D(flFA))/2, 

where  Pp a  and  F*md  are  the  false-alarm  and  missed-detection  probabilities. 

5.6.1.  DCT  domain  steganography.  To  apply  the  proposed  framework,  one  first  needs  to  design  an  additive  distortion 
function  that  can  be  tested  by  simulating  the  embedding  as  if  the  best  codes  are  available.  Finally,  the  the  most  promising 
approach  is  implemented  using  STCs.  It  is  assumed  that  the  cover  is  a  grayscale  bitmap  image  which  is  JPEG  compressed 
to  obtain  the  cover  image.  Let  A  be  a  set  of  indices  corresponding  to  AC  DCT  coefficients  after  the  block-DCT  transform 
and  let  c*  be  the  ith  AC  coefficient  before  it  is  quantized  with  the  quantization  step  g,  for  i  E  A.  Let  X  represent  the 
set  of  all  vectors  containing  quantized  AC  DCT  coefficients  divided  by  their  corresponding  quantization  step.  In  ordinary 
JPEG  compression,  the  values  c*  are  quantized  to  x*  =  [< Ci/qi ]. 

[Proposed  distortion  functions] 

A  binary  embedding  operation  is  defined  as  Z*  =  {xt,x*}  by  =  x*  +  sign(ct/gt-  —  Xi),  where  sign(x)  is  1  if  x  >  0,  -1 
if  x  <  0  and  sign(0)  E  {-1,1}  uniformly  at  random.  In  simple  words,  x*  is  a  quantized  AC  DCT  coefficient  and  x<  is  the 
same  coefficient  when  quantized  in  the  opposite  direction.  Let  e*  =  |c,/gi  -  xt|  be  the  quantization  error  introduced  by 
JPEG  compression.  By  replacing  x,  with  x,  the  error  becomes  \Cifqi  -  x,  |  =  1  -  e*.  If  et  =  0.5,  then  the  direction  where 
Ci/qi  is  rounded  depends  on  the  implementation  of  the  JPEG  compressor  and  only  small  perturbation  of  the  original 
image  may  lead  to  different  results.  Let  V(x)  =  x  mod  2.  By  construction,  V  satisfies  the  property  of  a  parity  function, 
V(xi)  ^  V(xi).  The  distortion  function  is  assumed  to  be  in  the  form  D(xfy)  =  £^1  Qi  -  [%i  ^  y<],  where  n  =  |*4|. 

The  following  four  approaches  utilizing  values  of  e,  and  qt  were  considered.  All  methods  assign  Qi  =  00  when  Ci/qi  E 
(—0.5, 0.5)  and  differ  in  the  definition  of  the  remaining  values  Qi  as  follows: 


17The  additivity  constraint  can  be  relaxed  and  more  general  distortion  measures  can  be  used  with  the  PLS  and  DLS  problems  in  practice  [23]. 
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d3  =  Xi,j+ 2  —  Xi,j+3 

FIGURE  5.11.  Set  of  4- pixel  cliques  used  for  calculating  the  distortion  for  digital  images  represented  in 
the  spatial-domain.  The  final  distortion  pijiyij)  is  obtained  as  a  sum  of  terms  penalizing  the  change  in 
pixel  Xij  measured  w.r.t.  each  clique  containing  Xij. 

•  Si:  Qi  =  1  —  2 e*  iici/qt  £  (-0.5, 0.5)  (as  in  perturbed  quantization  [33]), 

•  S2:  Qi  =  qi(  1  -  2e»)  if  a/qi  &  (-0.5, 0.5)  (the  same  as  SI  but  Qi  is  weighted  by  the  quantization  step), 

•  S3:  Qi  =  1  if  Ci/qi  €  (—1,  —0.5]  U  [0.5, 1)  and  Qi  =  1  —  2e*  otherwise,  and 

•  S4:  Q{  =  qt  if  a/qi  E  (—1,  -0.5]  U  [0.5, 1)  and  Qi  =  qi(  1  —  2 e»)  otherwise  which  is  similar  weight  assignment  as 

proposed  in  [71]. 

To  see  the  importance  of  the  side-information  in  the  form  of  the  uncompressed  cover  image,  the  PI  also  includes  in  the 
tests  the  nsF5  [37]  algorithm,  wliich  can  be  represented  using  the  above  formalism  as  =  [c*/g»],  Xi  =  x»  —  sign(xi),  and 
Qi  =  oo  if  Xi  =  0  and  Qi  =  1  otherwise.  This  way,  one  always  has  |x*|  <  |x*|.  The  nsF5  embedding  minimizes  the  number 
of  changes  to  non-zero  AC  DCT  coefficients. 

[Steganalysis  setup  and  experimental  results] 

The  proposed  strategies  were  tested  on  a  database  of  6,  500  digital  camera  images  prepared  as  described  in  [58,  Sec. 
4.1]  so  that  their  smaller  size  was  512  pixels.  The  JPEG  quality  factor  75  was  used  for  compression.  The  steganalyzer 
employed  the  548-dimensional  CC-PEV  feature  set  [57].  Figure  5.10  shows  the  minimum  average  classification  error  Pe 
achieved  by  simulating  each  strategy  on  the  bound  using  the  PLS  formulation.  The  strategies  Si  and  S2,  which  assign 
zero  cost  to  coefficients  a/qi  =  0.5,  wfere  worse  than  the  nsF5  algorithm  that  does  not  use  any  side-information.  On  the 
other  hand,  strategy  S4,  which  also  utilizes  the  knowledge  about  the  quantization  step,  wras  the  best.  By  implementing 
this  strategy,  it  is  necessary  to  deal  with  a  wet  paper  channel  which  can  be  well  modeled  by  a  linear  profile  with  relative 
wetness  r  «  0.6  depending  on  the  image  content.  The  PI  implemented  strategy  S4  using  STCs,  where  wet  pixels  were 
handled  by  setting  Qi  =  C  for  a  sufficiently  large  C.  As  seen  from  the  results  using  STCs,  payloads  below  0.15  bits  per 
non-zero  AC  DCT  coefficient  were  undetectable  using  the  employed  steganalyzer. 

Note  that  the  strategy  utilized  only  the  information  obtainable  from  a  single  AC  DCT  coefficient.  In  reality,  Qi  will 
likely  depend  on  the  local  image  content,  quantization  errors,  and  quantization  steps.  The  PI  postpones  the  problem  of 
optimizing  D  w.r.t.  statistical  detectability  to  future  research. 

5.6.2.  Spatial  domain  steganography.  To  demonstrate  the  merit  of  the  STC-based  multi-layered  construction,  the  PI 
presents  a  practical  embedding  scheme  that  was  largely  motivated  by  [65]  and  [23].  Single  per-pixel  distortion  function 
piyj{yij)  should  assign  the  cost  of  changing  the  i,jth  pixel  x»j,  first,  from  its  neighborhood  and  then  also  based  on  the 
new  value  yij.  Changes  made  in  smooth  regions  often  tend  to  be  highly  detectable  by  blind  steganalysis  wrhich  should 
lead  to  high  distortion  values.  On  the  other  hand,  pixels  which  are  in  busy  and  hard-to-model  regions  can  be  changed 
more  often. 

[Proposed  distortion  functions] 

The  distortion  function  is  designed  based  on  a  model  built  from  a  set  of  all  straight  4-pixel  lines  in  4  different  orientations 
containing  the  i,  jth  pixel  called  cliques  (see  Figure  5.11).  Based  on  the  set  of  all  such  cliques,  one  decides  on  the  value 
PiyjiVij)-  Due  to  strong  inter-pixel  dependencies,  most  cliques  contain  very  similar  values  and  thus  differences  between 
neighboring  pixels  tend  to  be  very  close  to  zero.  It  has  been  experimentaly  observed  [65],  that  the  number  of  cliques  falls 
of  quickly  as  the  differences  get  larger.  From  this  point  of  view,  any  clique  with  small  differences  should  lead  to  a  larger 
distortion  because  there  are  more  samples  the  warden  can  use  for  training  her  steganalyzer. 
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Figure  5.12.  Comparison  of  LSB  matching  with  optimal  binary  and  ternary  coding  with  embedding 
algorithms  based  on  the  additive  distortion  measure  (5.15)  using  embedding  operations  of  three  different 
cardinalities. 

More  formaly,  let  x  €  {0, . . . ,  255}niXU2  be  an  n\  x  n 2  grayscale  cover  image,  n  =  nin2,  represented  in  the  spatial 
domain.  Define  the  co-occurrence  matrix  computed  from  horizontal  pixel  differences  jD^j(x)  =  Xtj+i  —  ,  t  =  1, . . . , nj , 

j  =  -  1: 

«i(na-3) 

where  {(D^jyD^j+1,D^j+2)(x)  =  (p,q,r )]  =  [(£>£(x)  =  +1  (x)  =  q)k(D^j+2{x)  =  r)).  Clearly,  ^,,r(x)  €  [0,1] 

is  the  normalized  count  of  neighboring  quadruples  of  pixels  {xij,Ztj+i,x,j+2,Xij4.s}  with  differences  Xij+i  —  xtj  =  p, 
^t,i+2  =  q ,  and  xt%j+ 3  —  x*j+ 2  =  r  in  the  entire  image.  The  superscript  arrow  denotes  the  fact  that  the 

differences  are  computed  by  subtracting  the  left  pixel  from  the  right  one.  The  matrices  A£q  r(x)}  A J  y  r(x),  and  A^qr(x) 
are  defined  similarly.  Let  be  an  image  obtained  from  x  by  replacing  the  (i,  j)th  pixel  with  value  yij.  Finally,  the 

PI  defines  the  distortion  measure  D(y)  =  YXh  PijiVij)  by 

(5*15)  PijiViyj)  =  ^2  WP«9«rl^p.flir(X)  ~ 

P,<7,r6{-255,...,255} 

*€{-♦, 

where  wp^r  =  1/(1  4-  %/p2  4-  q2  4-  r2)  are  heuristically  chosen  weights. 

[Steganalysis  setup  and  experimental  results] 

All  tests  were  carried  out  on  the  BOWS2  database  [2]  containing  approximately  10, 800  grayscale  images  with  a  fixed 
size  of  512  x  512  pixels  coining  from  rescaled  and  cropped  natural  images  of  various  sizes.  Steganalysis  was  implemented 
using  the  second-order  SPAM  feature  set  with  T  =  3  [64]. 

Figure  5.12  contains  the  comparison  of  embedding  algorithms  implementing  the  PLS  and  DLS  with  the  costs  (5.15). 
All  algorithms  are  contrasted  with  LSB  matching  simulated  on  the  binary  and  ternary  bounds.  To  compare  the  effect  of 
practical  codes,  the  PI  first  simulated  the  embedding  algorithm  as  if  the  best  codes  were  available  and  then  compared 
these  results  with  algorithms  implemented  using  STCs  with  h  =  10.  Both  types  of  senders  are  implemented  with  binary, 
ternary  (X*  =  {x*  —  1, . . . ,  Xi  4  1}),  and  pentary  (2»  =  {xt  -  2, . . . ,  x»  4-  2})  embedding  operations.  Before  embedding,  the 
binary  embedding  operation  was  initialized  to  2,*  =  {x^,  j/i}  with  y*  randomly  chosen  from  {x»  —  l,xt  4- 1}.  The  reported 
payload  for  the  DLS  with  a  fixed  D€  was  calculated  as  an  average  over  the  whole  database  after  embedding. 

The  relative  horizontal  distance  between  the  corresponding  dashed  and  solid  lines  in  Figure  5.12  is  bounded  by  the 
coding  loss.  Most  of  the  proposed  algorithms  are  undetectable  for  relative  payloads  a  <  0.2  bits  per  pixel  (bpp).  For 
payloads  a  <  0.5,  the  DLS  is  more  secure.  For  larger  payloads,  the  distortion  measure  seems  to  fail  to  capture  the  statistical 
detectability  correctly  and  thus  the  algorithms  are  more  detectable  than  when  implemented  in  the  payload-limited  regime. 
Finally,  the  results  suggest  that  larger  embedding  changes  are  useful  for  steganography  when  placed  adaptively. 
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5.7.  Discussion.  The  concept  of  embedding  in  steganography  that  minimizes  a  distortion  function  is  connected  to  many 
basic  principles  used  for  constructing  embedding  schemes  for  complex  cover  sources  today,  including  the  principle  of 
minimal-embedding-impact  [37],  approximate  model-preservation  [65],  or  the  Gibbs  construction  [23].  The  current  work 
describes  a  complete  practical  framework  for  constructing  steganographic  schemes  that  embed  by  minimizing  an  additive 
distortion  function.  Once  the  steganographer  specifies  the  form  of  the  distortion  function,  the  proposed  framework  provides 
all  essential  tools  for  constructing  practical  embedding  schemes  working  dose  to  their  theoretical  bounds.  The  methods 
are  not  limited  to  binary  embedding  operations  and  allow  the  embedder  to  choose  the  amplitude  of  embedding  changes 
dynamically  based  on  the  cover-image  content.  The  distortion  function  or  the  embedding  operation  do  not  need  to  be 
shared  with  the  recipient.  In  fact,  they  can  even  change  from  image  to  image.  The  framework  can  be  thought  of  as  an 
off-the-shelf  method  that  allows  practitioners  to  concentrate  on  the  problem  of  designing  the  distortion  measure  instead 
of  the  problem  of  how  to  construct  practical  embedding  schemes. 

The  merit  of  the  proposed  algorithms  is  demonstrated  experimentally  by  implementing  them  for  the  JPEG  and  spatial 
domains  and  showing  an  improvement  in  statistical  detectability  as  measured  by  state-of-the-art  blind  steganalyzcrs.  The 
PI  has  demonstrated  that  larger  embedding  changes  provide  a  significant  gain  in  security  when  placed  adaptively.  Finally, 
the  construction  is  not  limited  to  embedding  with  larger  amplitudes  but  can  be  used,  e.g.,  for  embedding  in  color  images, 
where  the  LSBs  of  all  three  colors  can  be  seen  as  3-bit  symbols  on  which  the  cost  functions  are  defined.  Applications 
outside  the  scope  of  digital  images  are  possible  as  long  as  the  costs  can  be  defined. 

The  embedding  using  an  additive  distortion  function  has  been  greatly  extended  to  essentially  arbitrary  distortion 
functions  as  described  in  [23]  and  in  the  next  section. 
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6.  Gibbs  Construction  in  Steganography 

This  section  describes  one  of  the  main  achievements  of  this  effort  -  a  general  construction  for  building  steganographic 
schemes  using  the  principle  of  minimum  embedding  distortion.  The  contribution  is  based  on  a  connection  between 
steganography  design  by  minimizing  embedding  distortion  and  statistical  physics.  The  unique  aspect  of  this  work  and  one 
that  distinguishes  it  from  prior  art  is  that  the  distortion  function  is  allowed  to  be  arbitrary,  which  permits  considering 
spatially-dependent  embedding  changes.  The  PI  provides  a  complete  theoretical  framework  and  describes  practical  tools, 
such  as  the  thermodynamic  integration  for  computing  the  rate-distortion  bound  and  the  Gibbs  sampler,  for  simulating 
the  impact  of  optimal  embedding  schemes  and  constructing  practical  algorithms.  The  proposed  framework  reduces  the 
design  of  secure  steganography  in  empirical  covers  to  the  problem  of  finding  local  potentials  for  the  distortion  function 
that  correlate  with  statistical  detectability  in  practice.  By  working  out  the  proposed  methodology  in  detail  for  a  specific 
choice  of  the  distortion  function,  the  PI  experimentally  validates  the  approach  and  discusses  various  options  available  to 
the  steganographer  in  practice. 

There  exist  two  general  and  widely  used  principles  for  designing  steganographic  methods  for  empirical  cover  objects, 
such  as  digital  images.  The  first  one  is  model-preserving  steganography  in  which  the  designer  adopts  a  model  of  the  cover 
source  and  then  designs  the  embedding  to  either  completely  or  approximately  preserve  the  model  [45,  69,  72,  74,  81]. 
This  way,  one  can  provide  mathematical  guarantee  that  the  embedding  is  perfectly  secure  (or  e-secure)  within  the  chosen 
model.  A  problem  is  that  empirical  cover  objects  are  notoriously  difficult  to  model  accurately,  and,  as  history  testifies, 
the  model  mismatch  can  be  exploited  by  an  attacker  to  construct  a  sensitive  detection  scheme.  Even  worse,  preserving 
an  oversimplified  model  could  introduce  a  security  weakness  [8,  56,  91].  An  obvious  remedy  is  to  use  more  complicated 
models  that  would  better  approximate  the  cover  source.  The  major  obstacle  here  is  that  most  current  model-preserving 
steganographic  constructions  are  specific  to  a  certain  model  and  do  not  adapt  easily  to  more  complex  models. 

The  second,  quite  pragmatic,  approach  avoids  modeling  the  cover  source  altogether  and,  instead,  minimizes  a  heuristically- 
defined  embedding  distortion  (impact).  Matrix  embedding  [15],  wet  paper  codes  [36],  and  minimal  embedding  distortion 
steganography  [26,  31,  33,  55,  71)  are  examples  of  this  philosophy.  Despite  its  heuristic  nature,  the  principle  of  minimum 
embedding  distortion  has  produced  the  most  secure  steganographic  methods  for  digital  media  known  today,  at  least  in 
terms  of  low  statistical  detectability  as  measured  using  blind  steganalyzers  [37,  55,  58,  71].  Most  of  these  schemes,  however, 
use  a  distortion  function  that  is  additive  -  the  total  distortion  is  a  sum  of  individual  pixel  distortions  computed  from  the 
cover  image .  Fundamentally,  such  a  distortion  function  cannot  capture  interactions  among  embedding  changes,  which 
leads  to  suboptimality  in  practice.  This  deficiency  affects  especially  adaptive  schemes  for  which  the  embedding  changes 
have  a  tendency  to  form  clusters  because  the  pixel  distortion  is  derived  from  local  content  or  some  content-dependent 
side-information.  For  example,  the  embedding  changes  might  follow  edges  or  be  concentrated  in  textured  regions. 

One  discovers  a  relationship  between  both  embedding  principles  when  the  distortion  function  is  defined  as  a  weighted 
norm  of  the  difference  between  feature  vectors  of  cover  and  stego  objects  in  some  properly  chosen  feature  space  [56,  66],  an 
example  of  which  are  spaces  utilized  by  blind  steganalyzers.  The  projection  onto  the  feature  space  is  essentially  equivalent 
to  modeling  the  objects  in  a  lowfer-dimensional  Euclidean  space.  Consequently,  minimizing  the  distortion  between  cover 
and  stego  objects  in  the  feature  space  now  becomes  closely  tied  to  model  preservation.  Yet  again,  in  this  case  the  distortion 
cannot  be  written  as  a  sum  of  individual  pixel  distortions  also  because  the  features  contain  higher-order  statistics,  such 
as  sample  transition  probability  matrices  of  pixels  or  DOT  coefficients  modeled  as  Markov  chains  [11,  64,  67,  77]. 

The  importance  of  modeling  interactions  among  embedding  changes  in  steganography  has  been  indirectly  recognized 
by  the  designers  of  MPSteg  [10]  (Matching  Pursuit  Steganography)  and  YASS  [73,  80].  In  MPSteg,  the  authors  use 
an  overcomplete  basis  and  embed  messages  by  replacing  small  blocks  with  other  blocks  with  the  hope  of  preserving 
dependencies  among  neighboring  pixels.  The  YASS  algorithm  taught  us  that  a  high  embedding  distortion  may  not 
directly  manifest  as  a  high  statistical  detectability,  a  curious  property  that  can  most  likely  be  attributed  to  the  fact  that 
the  embedding  modifications  are  content  driven  and  mutually  correlated.  Both  approaches  are  heuristic  in  nature  and 
leave  many  important  issues  unanswered,  including  establishing  performance  bounds,  evaluating  the  methods7  performance 
w.r.t.  to  these  bounds,  and  creating  a  methodology  for  achieving  near-optimal  performance. 

The  above  discussion  underlines  the  need  for  a  more  systematic  approach  to  steganography  that  can  consider  mutual 
interaction  of  embedding  modifications,  which  is  the  topic  of  this  section.  The  main  contribution  is  a  general  framework  for 
embedding  using  arbitrary  distortion  functions  and  a  complete  practical  methodology  for  minimizing  embedding  distortion 
in  steganography.  The  approach  is  flexible  as  well  as  modular  and  allows  the  steganographer  to  work  with  non-additive 
distortion  functions.  The  PI  provides  algorithms  for  computing  the  proper  theoretical  bounds  expressing  the  maximal 
payload  embeddable  with  a  bounded  distortion,  for  simulating  the  impact  of  a  stegosystem  operating  on  the  bound,  and 
for  designing  practical  steganographic  algorithms  that  operate  near  the  bound.  The  algorithms  leverage  standard  tools 
used  in  statistical  physics,  such  as  Markov  chain  Monte  Carlo  samplers  or  the  thermodynamic  integration. 
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In  the  next  section,  the  PI  recalls  the  basic  result  that  embedding  changes  made  by  a  steganographic  method  that 
minimizes  embedding  distortion  must  follow  a  particular  form  of  Gibbs  distribution.  The  main  purpose  of  this  section 
is  to  establish  terminology  and  make  connections  between  the  concepts  used  in  steganography  and  those  in  statistical 
physics.  In  Section  6.2,  the  PI  introduces  the  so-called  separation  principle,  which  includes  several  distinct  tasks  that 
must  be  addressed  when  developing  a  practical  steganographic  method.  In  particular,  to  design  and  evaluate  practical 
schemes  one  needs  to  establish  the  relationship  between  the  maximal  payload  embeddable  using  bounded  distortion  (the 
rate-distortion  bound)  and  be  able  to  simulate  the  impact  of  a  scheme  operating  on  the  bound.  In  the  special  case 
when  the  embedding  distortion  can  be  expressed  as  a  sum  of  distortions  at  individual  pixels  computed  from  the  cover 
image  (the  so-called  non-interacting  embedding  changes),  the  design  of  near-optimal  embedding  algorithms  has  been 
successfully  resolved  in  the  past.  For  completeness,  and  because  the  proposed  framework  builds  upon  these  results,  the  PI 
briefly  summarizes  such  known  achievements  in  Section  6.3.  Continuing  with  the  case  of  a  general  distortion  function,  in 
Section  6.4  the  PI  describes  two  useful  tools  for  steganographers  -  the  Gibbs  sampler  and  the  thermodynamic  integration. 
The  Gibbs  sampler  can  be  used  to  simulate  the  impact  of  optimal  embedding  and  to  construct  practical  steganographic 
schemes  (in  Sections  6.5  and  6.6).  The  thermodynamic  integration  is  a  method  for  estimating  the  entropy  and  partition 
function  in  statistical  physics  and  it  is  used  for  computing  the  rate-  distortion  bound  in  steganography.  The  design  of 
practical  embedding  schemes  begins  in  Section  6.5  with  the  study  of  distortion  functions  that  can  be  written  as  a  sum  of 
local  potentials  defined  on  cliques.  In  Section  6.6,  the  PI  first  discusses  various  options  the  new  framework  offers  to  the 
steganography  designer  and  then  makes  a  connection  between  local  potentials  and  image  models  used  in  blind  steganalysis. 
The  proposed  framework  is  experimentally  validated  in  Section  6.7  with  a  discussion  of  various  implementation  issues. 
Finally,  the  section  is  concluded  in  Section  6.8. 

6.1.  Gibbs  distribution  minimizes  embedding  distortion.  The  PI  first  recalls  a  well-known  and  quite  general  fact 
that,  for  a  given  expected  embedding  distortion,  the  maximal  payload  is  embedded  when  the  embedding  changes  follow 
a  Gibbs  distribution.  This  establishes  a  connection  between  steganography  and  statistical  physics,  which,  later  in  this 
section,  will  allow  computing  rate-distortion  bounds,  simulate  the  impact  of  optimal  embedding,  and  construct  practical 
embedding  algorithms. 

Although  the  entire  framework  is  certainly  applicable  to  steganography  in  other  objects  than  digital  images,  it  is 
described  using  the  terms  “image”  and  “pixel”  for  concreteness  to  simplify  the  language  and  to  allow  a  smooth  transition 
from  theory  to  experimental  validation,  which  is  carried  out  for  digital  images.  An  image  x  =  (xi,...,xn)  G  X  =27* 
is  a  regular  lattice  of  elements  (pixels)  X*  G  I,  i  G  <S,  S  =  {1, . . . ,  n}.  The  dynamic  range,  I,  depends  on  the  character 
of  the  image  data.  For  example,  for  an  8-bit  grayscale  image,  2  —  {0, 1, . . . ,  255}.  In  general,  Xi  can  stand  not  only 
for  light  intensity  values  in  a  raster  image  but  also  for  transform  coefficients,  palette  indices,  audio  samples,  etc.  The 
proposed  framework  remains  valid  even  when  Xi  is  organized  into  an  arbitrary  graph  structure.  For  notational  simplicity 
and  convenience,  additional  conventions  are  adopted.  Given  J  C  5,  xj  =  {x,|i  G  J}  and  x^j  =  {x*|t  G  S  -  J7}.  The 
image  (xi , . . . ,  i ,  t/,,  xt+i , . . . ,  xn)  will  be  abbreviated  as 

Every  steganographic  embedding  scheme  considered  here  will  be  associated  with  a  mapping  that  assigns  to  each  cover 
x  G  X  the  pair  {X  7 r}.  Here,  y  C  X  is  the  set  of  all  stego  images  y  into  which  x  is  allowed  to  be  modified  by  embedding 
and  7r  is  a  probability  mass  function  on  y  that  characterizes  the  actions  of  the  sender.  The  embedding  algorithm  is 
such  that,  for  a  given  cover  x,  the  stego  image  y  G  y  is  sent  with  probability  ?r(y).  The  stego  image  is  thus  a  random 
variable  Y  over  y  with  the  distribution  P( Y  =  y)  =  7r(y).  Technically,  the  set  y  and  all  concepts  derived  from  it  in 
this  report  depend  on  x.  However,  because  x  is  simply  a  parameter  that  one  fixes  in  the  very  beginning ,  the  notation  is 
simplified  by  not  making  the  dependence  on  x  explicit.  Finally,  note  that  the  maximal  expected  payload  that  the  sender 
can  communicate  to  the  receiver  in  this  manner  is  the  entropy 

(6.1)  H( n)  4  H( Y)  =  -  £  *(y)  logTr(y). 

To  put  it  another  way,  the  PI  defines  a  steganographic  method  from  the  point  of  view  of  how  it  modifies  the  cover  and 
only  then  deals  with  the  issues  of  how  to  use  it  for  communication  and  how  to  optimize  its  performance.  The  optimization 
will  involve  finding  the  distribution  t r  for  given  x,  y ,  and  payload  (distortion). 

The  following  special  form  of  the  set  y:  y  =  X\  x  I2  x  *  *  *  x  2*  will  be  considered,  where  2*  C  2.  For  example,  in 
Least  Significant  Bit  (LSB)  embedding,  I ,  =  {x^x*},  where  the  bar  denotes  the  operation  of  flipping  the  LSB.  In  LSB 
matching  [49]  (also  called  ±1  embedding)  in  an  8-bit  grayscale  image  x,  2,  =  {x*  —  l,x,,xt  -f  1}  whenever  xt  £  {0,255} 
and  Xi  is  appropriately  modified  for  the  boundary  cases.  When  |2*|  =  2  or  3  for  all  i,  one  speaks  of  binary  and  ternary 
embedding,  respectively.  In  general,  however,  the  size  of  every  set  Xi  can  be  different.  For  example,  pixels  not  allowed  to 
be  modified  during  embedding  (the  so-called  wet  pixels  [36])  have  2 »  =  fx*}. 
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By  sending  a  slightly  modified  version  y  of  the  cover  x,  the  sender  introduces  a  distortion,  which  will  be  measured 
using  a  distortion  function 

(6.2)  D  :  y  ->  R, 

that  is  bounded,  i.e.,  |Z>(y)|  <  K,  for  all  y  6  y  for  some  sufficiently  large  K .  Note  that  D  also  depends  on  x.  Allowing 
the  distortion  to  be  negative  does  not  cause  any  problems  because  an  embedding  algorithm  minimizes  D  if  and  only  if  it 
minimizes  the  non-negative  distortion  D  +  K.  The  need  for  negative  distortion  will  become  apparent  later  in  Section  6.5.1. 
The  expected  embedding  distortion  introduced  by  the  sender  is 

(6.3)  2?W[D]  =  2>(y)Z)(y). 

y€3> 

An  important  premise  made  now  is  that  the  sender  is  able  to  define  the  distortion  function  so  that  it  is  related  to 
statistical  detectability.18  This  assumption  is  motivated  by  a  rather  large  body  of  experimental  evidence,  such  as  [37,  58], 
that  indicates  that  even  simple  distortion  measures  that  merely  count  the  number  of  embedding  changes  correlate  well 
with  statistical  detectability  in  the  form  of  decision  error  of  steganalyzers  trained  on  cover  and  stego  images.  In  general, 
steganographic  methods  that  introduce  smaller  distortion  disturb  the  cover  source  less  than  methods  that  embed  with 
larger  distortion. 

Distortion-limited  sender.  To  maximize  the  security,  the  so-called  distortion-limited  sender  attempts  to  find  a 
distribution  tt  on  y  that  has  the  highest  entropy  and  whose  expected  embedding  distortion  does  not  exceed  a  given  D€ : 

(6.4)  maximize  H(n)  =  -  V'  7r(y)  log7r(y) 

7T  ■'  - 

y€?y 

(6.5)  subject  to  En[D)  =  ^  7r(y)D(y)  =  De. 

y  ey 

By  fixing  the  distortion,  the  sender  fixes  the  security  and  aims  to  communicate  as  large  payload  as  possible  at  this  level  of 
security.  The  maximization  in  (6.4)  is  carried  over  all  distributions  7 r  on  y .  The  PI  comments  on  whether  the  distortion 
constraint  should  be  in  the  form  of  equality  or  inequality  shortly. 

Payload-limited  sender.  Alternatively,  in  practice  it  may  be  more  meaningful  to  consider  the  payload-limited  sender 
who  faces  a  complementary  task  of  embedding  a  given  payload  of  m  bits  with  minimal  possible  distortion.  The  optimization 
problem  is  to  determine  a  distribution  7r  that  communicates  a  required  payload  wffiile  minimizing  the  distortion: 

(6.6)  minimize  En[D]  =  V]  7r(y)D(y) 

yey 

(6.7)  subject  to  H( n)  =  m. 

The  optimal  distribution  7 r  for  both  problems  has  the  Gibbs  form 

(6-8)  7rA(y)  =  — |--exp(-AD(y)), 

where  Z( A)  is  the  normalizing  factor 

(6.9)  Z(A)  =  £exp(-AZ>(y)). 

yey 

The  optimality  of  follows  immediately  from  the  fact  that  for  any  distribution  /i  with  Efx[D\  =  ]Cy fJL(y)^(y)  = 
the  difference  between  their  entropies,  -  H(fi)  =  Dkl{v\\*\)  >  0  [92].  The  scalar  parameter  A  >  0  needs  to  be 

determined  from  the  distortion  constraint  (6.5)  or  from  the  payload  constraint  (6.7),  depending  on  the  type  of  the  sender. 
Provided  m  or  Dt  are  in  the  feasibility  region  of  their  corresponding  constraints,  the  value  of  A  is  unique.  This  follows 
from  the  fact  that  both  the  expected  distortion  and  the  entropy  are  monotone  decreasing  in  A.  To  see  this,  realize  that 
by  direct  evaluation 

(6-10)  ^Enx[D]  = -Varn,[D]  <  0, 

where  VarKx  [JD]  =  Enx  [D2]  —  (£,7rx  [ D ])2.  Substituting  (6.8)  into  (6.1),  the  entropy  of  the  Gibbs  distribution  can  be  written 
as 

18The  ability  of  a  warden  to  distinguish  between  cover  and  stego  images  using  statistical  hypothesis  testing. 
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(6.11)  H(*x)  =  logZ(A)  +  ^A  E„x{D}. 

Upon  differentiating  and  using  (6.10),  one  obtains 

(612)  Tx"M  ~  In2  (W  +E’M  -Al,<lr~|Dl) 

(6.13)  =_^_^ar^PJ<0. 

The  monotonicity  also  means  that  the  equality  distortion  constraint  in  the  optimization  problem  (6.5)  can  be  replaced 
with  inequality,  which  is  perhaps  more  appropriate  given  the  motivating  discussion  above. 

By  varying  A  €  [0,  oo),  one  obtains  a  relationship  between  the  maximal  expected  payload  (6.1)  and  the  expected 
embedding  distortion  (6.3).  For  brevity,  this  relationship  will  be  called  the  rate-distortion  bound.  What  distinguishes  this 
concept  from  a  similar  notion  defined  in  information  theory  is  that  the  bound  is  considered  for  a  given  cover  x  rather  than 
for  X,  which  is  a  random  variable.  At  this  point,  it  is  appropriate  to  note  that  while  it  is  certainly  possible  to  consider  x 
to  be  generated  by  a  cover  source  with  a  known  distribution  and  approach  the  design  of  steganography  from  a  different 
point  of  view,  namely  one  in  which  tt\  is  determined  by  minimizing  the  KL  divergence  between  the  distributions  of  cover 
and  stego  images  while  satisfying  a  payload  constraint,  it  is  not  done  here. 

Finally,  note  that  the  assumption  JZ>(y)|  <  K  implies  that  all  stego  objects  appear  with  nonzero  probability,  7r*(y)  > 
exp(— A K),  a  fact  that  is  crucial  for  the  theory  developed  here. 

Remark  12.  In  statistical  physics,  the  term  distortion  is  known  as  energy.  The  optimality  of  Gibbs  distribution  is 
formulated  as  the  Gibbs  variational  principle:  “ Among  all  distributions  with  a  given  energy,  the  Gibbs  distribution  (6.8) 
has  the  highest  entropy.”  The  parameter  A  is  called  the  inverse  temperature,  A  =  1/fcT,  where  T  is  the  temperature  and 
k  the  Boltzmann  constant.  The  normalizing  factor  Z( A)  is  called  the  partition  function. 

6.2.  The  separation  principle.  The  design  of  steganographic  methods  that  attempt  to  minimize  embedding  distortion 
should  be  driven  by  their  performance.  The  obvious  choice  here  is  to  contrast  the  performance  with  the  rate-distortion 
bound.  This  is  a  meaningful  comparison  for  the  distortion-limited  sender  who  can  assess  the  performance  of  a  practical 
embedding  scheme  by  its  loss  of  payload  w.r.t.  the  maximum  payload  embeddable  using  a  fixed  distortion.  This  so-called 
“coding  loss”  informs  the  sender  of  how  much  payload  is  lost  for  a  fixed  statistical  detectability.  On  the  other  hand,  it  is 
much  harder  for  the  payload-limited  sender  to  assess  how  the  increased  distortion  of  a  suboptimal  practical  scheme  impacts 
statistical  detectability  in  practice.  One  could  resolve  this  rather  important  practical  issue  if  one  was  able  to  simulate  the 
impact  of  a  scheme  that  operates  on  the  bound.19  Because  the  problems  of  establishing  the  bounds,  simulating  optimal 
embedding,  and  creating  a  practical  embedding  algorithm  are  really  three  separate  problems,  this  reasoning  will  be  called 
the  separation  principle.  It  involves  addressing  the  following  three  tasks: 

(1)  Establishing  the  rate-distortion  bounds.  This  means  solving  the  optimization  problems  (6.4)  or  (6.6)  and 
expressing  the  largest  payload  embeddable  using  a  bounded  distortion  (or  minimal  distortion  needed  to  embed  a 
given  payload).  These  bounds  inform  the  steganographer  about  the  best  performance  that  can  be  theoretically 
achieved.  Depending  on  the  form  of  the  distortion  function  D,  establishing  the  bounds  is  usually  rather  challenging 
and  one  may  have  to  resort  to  numerical  methods  (Section  6.4.2).  For  an  additive  distortion  (to  be  precisely  defined 
shortly),  an  analytic  form  of  the  bounds  may  be  obtained  (Section  6.3). 

(2)  Simulating  an  optimal  embedding  method.  Often,  it  is  very  hard  to  construct  a  practical  embedding 
method  that  performs  close  to  the  bound.  However,  one  may  be  able  to  simulate  the  impact  of  such  an  optimal 
method  and  thus  subject  it  to  tests  using  steganalyzers  even  when  it  is  not  known  how  to  construct  a  practical 
embedding  algorithm  or  even  compute  the  bound  (see  Section  6.4).  This  is  important  for  developers  as  one  can 
effectively  “prune”  the  design  process  and  focus  on  implementing  the  most  promising  candidates.  The  simulator 
will  also  inform  the  payload-limited  sender  about  the  potential  improvement  in  statistical  undetectability  should 
the  theoretical  performance  gap  be  closed.  A  simple  example  is  provided  by  the  case  of  the  Hamming  distortion 
function  D( y)  =  YldVi  ^  £<]•  Here,  the  maximal  relative  payload  a  =  m/n  (in  bits  per  pixel  or  bpp)  is  bounded 
by  a  <  h(/3),  where  ft  =  £ Dc  is  the  relative  embedding  distortion  known  as  the  change  rate.  In  this  case,  one 
can  simulate  the  embedding  impact  of  the  optimal  scheme  by  independently  changing  each  pixel  with  probability 
h~1(a). 

19A  scheme  whose  embedding  distortion  and  payload  lay  on  the  rate-distortion  bound  derived  for  a  given  cover. 
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(3)  Constructing  a  practical  near-optimal  embedding  method.  This  point  is  of  most  interest  to  practitioners. 
The  bounds  and  the  simulator  are  necessary  to  evaluate  the  performance  of  any  practical  scheme.  The  designer 
tries  to  maximize  the  embedding  throughput  (the  number  of  bits  embedded  per  unit  time)  while  embedding  as 
close  to  the  distortion  bound  as  possible. 

It  should  be  stressed  at  this  point  that  even  though  the  optimal  distribution  of  embedding  modifications  has  a  known 
analytic  expression  (6.8),  it  may  be  infeasible  to  compute  the  individual  probabilities  7TA(y)  due  to  the  complexity  of 
evaluating  the  partition  function  Z( A),  which  is  a  sum  over  all  y,  whose  count  can  be  a  very  large  number  even  for  small 
images.  (For  example,  there  are  2n  binary  flipping  patterns  in  LSB  embedding.)  This  also  implies  that  at  present  it  is 
not  clear  how  to  compute  the  expected  distortion  (6.3)  or  the  entropy  (6.1)  (these  tasks  are  postponed  to  Section  6.4). 
Fortunately,  in  many  cases  of  practical  interest  it  is  not  necessary  to  evaluate  7r,\(y)  and  it  will  be  enough  to  merely 
sample  from  ir\.  The  ability  to  sample  from  tt\  is  sufficient  to  simulate  optimal  embedding  and  realize  practical  embedding 
algorithms,  and,  in  this  case,  even  compute  the  rate-distortion  bound. 

In  some  special  cases,  however,  such  as  when  the  embedding  changes  do  not  interact,  the  distortion  D  is  additive 
and  one  can  easily  compute  A  and  the  probabilities,  evaluate  the  expected  distortion  and  payload,  and  even  construct 
near-optimal  embedding  schemes.  As  this  special  case  will  be  used  later  in  Section  6.6  to  design  steganography  with  more 
general  distortion  functions  D,  it  is  briefly  reviewed  in  the  next  section. 


6.3.  Non-interacting  embedding  changes.  When  the  distortion  function  D  is  additive  over  the  pixels, 


n 

(6.14)  D(y)  =  J2pi(yi), 

1=1 

with  bounded  Pi  :  li  -»  R,  the  embedding  changes  are  said  to  not  interact.  In  this  case,  the  probability  7r*(y)  can  be 
factorized  into  a  product  of  marginal  probabilities  of  changing  the  individual  pixels  (this  follows  directly  from  (6.8)): 


(6.15) 


^A(y)  =  n*A(y,)=n 


»= i 


i=l 


exp(— A  pj(yQ) 
Et.€ li  exp(-A/>< (*<))' 


The  expected  distortion  and  the  maximal  payload  are: 

n 

(6.16)  E«y[D]  =  £  E 

i=l  (<€1. 
n 

(6.17)  H(*\)  =  “51  H  **(<<)  logTI^). 

«=1  ti€  X, 


The  impact  of  optimal  embedding  can  be  simulated  by  changing  x,  to  yx  with  probabilities  i v\(yi)  independently  of  the 
changes  at  other  pixels.  Since  these  probabilities  can  now  be  easily  evaluated  for  a  fixed  A,  finding  A  that  satisfies  the 
distortion  ( Enx[D ]  =  D€)  or  the  payload  (//(tt*)  =  m)  constraint  amounts  to  solving  an  algebraic  equation  for  A  (see  [31] 
or  [29]).  Because  both  the  expected  distortion  and  the  entropy  are  monotone  w.r.t.  A,  the  solution  is  unique.  The  only 
practical  near-optimal  embedding  algorithm  for  this  case  known  to  the  PI  is  based  on  syndrome- trellis  codes  [25]. 

It  will  be  instructional  to  work  out  as  an  example  the  details  of  the  special  case  of  binary  embedding  for  which 
li  =  {x|°\xt-^}  with  x|0)  =  x,-.  Thus,  px  attains  only  two  values,  p[l)  =  p*(x^),  t  =  0,1.  The  PI  stresses  at  this  point 

that  it  is  not  assumed  that  p|0)  =  0  or  even  that  pj1*  >  p<0).  This  fact  will  be  important  when  implementing  practical 
embedding  schemes  in  Section  6.5.1.  The  above  expressions  simplify  to 


(6.18) 

nx(xW)  =  exp(-A^>) 

otp(-Vil))+exp(-Vi0)) 

(6.19) 

1  +exp(-A(p(0)  -p|J)))  ^ 

(6.20) 

[D]  =  £  pi0)(l  -  P» (A))  +  A), 

1 

(6.21) 

ff(n)«£>(Pi(  A)). 

t=l 
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The  smallest  distortion  any  binary  embedding  algorithm  can  impose  is  £>min  =  nun{p(°\p^},  which  would  be 

incurred  when  selecting  y,-  =  x\l'\  where  U  =  argmin*{p-^}.  Thus, 

(6.22)  D( y)  =  jr  p<0)  [Vi  =  x<0)]  +  p\l)  [Vi  =  st(1)] 

t=i 

n 

(6.23)  =  Dmin  +  ^2  Qi  [y<  ±  a:,-£i)] , 

t=I 

where  Qi  =  [p-1'  —  pf^|  is  now  a  vector  of  non- negative  distortions,  which  allows  applying  the  practical  embedding 
algorithm  described  in  [26].  It  accepts  on  its  input  a  bit  stream  c  =  (ci(x), . . .  ,Cn(x))  (representing  the  cover  x),  the 
vector  of  non-negative  distortions  (pi, . . . ,  gn),  and  a  binary  message.  It  outputs  a  modified  (stego)  bit  stream  y  £  {0,  l}n 
that  conveys  the  message  as  a  syndrome  of  a  suitably  chosen  syndrome-trellis  code  so  that  the  total  embedding  distortion 
EZLi  Q%\y i  7^  Ci]  is  near  minimal.  It  follows  from  (6.23)  that  binary  embedding  as  defined  in  this  section  can  be  implemented 
in  practice  by  applying  this  algorithm  to  the  bit  stream  c*(x),  x  =  . . .  ,Xnn^). 

The  complete  derivation  of  the  rate-distortion  bound  for  binary  embedding  appears,  e.g.,  in  Chapter  7  of  [29]. 


6.4.  Simulated  embedding  and  rate-  distortion  bound.  In  Section  6.1,  it  has  been  showed  that  minimal-embedding- 
distortion  steganography  should  select  the  stego  image  y  with  probability  ^(y)  oc  exp(— AD(y))  expressed  in  the  form 
of  a  Gibbs  distribution.  The  PI  now  explains  a  general  iterative  procedure  using  which  one  can  sample  from  any  Gibbs 
distribution  and  thus  simulate  optimal  embedding.  The  method  is  recognized  as  one  of  the  Markov  Chain  Monte  Carlo 
(MCMC)  algorithms  known  as  the  Gibbs  sampler.20  This  sampling  algorithm  will  allow  constructing  practical  embedding 
schemes  in  Sections  6.5  and  6.6.  It  will  also  be  explained  how  to  compute  the  rate-distortion  bound  for  a  fixed  image 
using  the  thermodynamic  integration.  The  Gibbs  sampler  and  the  thermodynamic  integration  appear,  for  example,  in  [92] 
and  [62],  respectively. 


6.4.1.  The  Gibbs  sampler .  The  PI  starts  with  defining  the  local  characteristics  of  a  Gibbs  field  as  the  conditional  proba¬ 
bilities  of  the  ith  pixel  attaining  the  value  y\  conditioned  on  the  rest  of  the  image: 


(6.24) 


n\(Xi  =  y[\Y~i  =  y_{) 


n\(y'ty~i) 

Et,€  Xi  *A(<<y~i) ' 


For  all  possible  stego  images  y,  y '  €  y,  the  local  characteristics  (6.24)  define  the  following  matrices  P(t),  for  each  pixel 
i€{l . n}: 


(6.25) 


Py,y'(i) 


*\(Yi  =  =  y~j) 

0 


when  y^  =  y~< 
otherwise. 


Every  matrix  P(t)  has  |^|  rows  and  the  same  number  of  columns  (which  means  it  is  very  large)  and  its  elements  are  mostly 
zero  except  when  y1  was  obtained  from  y  by  modifying  t/t  to  y[  and  all  other  pixels  stayed  the  same.  Because  P(i)  is 
stochastic  (the  sum  of  its  rows  is  one), 


(6.26)  ^  /y,y>(i)  =  1*  for  all  rows  y, 

y'ev 

P(i)  is  a  transition  probability  matrix  of  some  Markov  chain  on  y.  All  such  matrices  satisfy  the  so-called  detailed  balance 
equation 

(6.27)  7TA(y).Py,y'(i)  =  7TA(y').Py',y(i),  for  all  y,/  €  i. 


20 


More  detailed  discussion  regarding  our  choice  of  the  MCMC  sampler  appear  later  in  this  section. 
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Algorithm  1  One  sweep  of  a  Gibbs  sampler. 

l:  Set  pixel  counter  i  =  1 
2:  while  i  <  n  do 

3:  Compute  the  local  characteristics: 

(6-34)  ^(i)y^#(i)ty((7(^))’  Va(i)  e -^(0 

4:  Select  one  y'a^  6  X ^(i)  pseudorandomly  according  to  the  probabilities  (G.34)  and  change  4- 

5:  £  <-  t  +  1 

6:  end  while 
7:  return  y 


To  see  this,  realize  that  unless  y ^  =  y^t-,  one  is  looking  at  the  trivial  equality  0  =  0.  For  y^»  =  yl^,  one  obtains  the 
following  chain  of  equalities: 


(6.28) 

n(y]fy.y(<!  = 

(6.29) 

(<>)  7TA(y)*-A(y') 

~  Et.ei,  **(*< y~<) 

(6.30) 

_  /../s  **(y) 

(6.31) 

Equality  (a)  follows  from  the  definition  of  P(i)  (6.25),  (6)  from  the  fact  that  y^i  =  y^,  and  (c)  from  ^(y)  =  ^(y^y^J 
and  again  (6.25). 

Next,  the  PI  defines  the  boldface  symbol  n\  6  [0,  oo)^l  as  the  vector  of  \y\  non-negative  elements  n\  =  7r*(y),  y  £  y. 
Using  (6.27)  and  then  (6.26),  one  can  now  easily  show  that  the  vector  tz\  is  the  left  eigenvector  of  P(t)  corresponding  to 
the  unit  eigenvalue: 

(6.32)  (7rAP(i))y<  =  Y  *&)**#•(*) 

yey 

(6.33)  =  Y  7rA(y,)py'.y(0  “  ^(y)- 

yey 

In  (6.32),  (7TAP(t))y'  is  the  y*th  element  of  the  product  of  the  vector  7T\  and  the  matrix  P(t). 

It  is  now  time  to  describe  the  Gibbs  sampler  [40],  which  is  a  key  element  in  the  framework.  Let  a  be  a  permutation  of 
the  index  set  S  called  the  visiting  schedule  (a(i),  i  =  1, . . . ,  n  is  the  ith  element  of  the  permutation  a).  One  sample  from 
is  then  obtained  by  repeating  a  series  of  “sweeps”  defined  below.  As  the  sweeps  and  the  Gibbs  sampler  are  explained, 
the  reader  is  advised  to  inspect  Algorithm  1  to  better  understand  the  process. 

The  sampler  is  initialized  by  setting  y  to  some  initial  value.  For  faster  convergence,  a  good  choice  is  to  select  yt  from 
Xi  according  to  the  local  characteristics  n\(yiX^i).  A  sweep  is  a  procedure  applied  to  an  image  during  which  all  pixels 
are  updated  sequentially  in  the  order  defined  by  the  visiting  schedule  a.  The  pixels  are  updated  based  on  their  local 
characteristics  (6.24)  computed  from  the  current  values  of  the  stego  image  y.  The  entire  sweep  can  be  described  by  a 
transition  probability  matrix  P(<j)  obtained  by  matrix-multiplications  of  the  individual  transition  probability  matrices 
P(a(i)): 

(6.35)  Py,yl(a)  A  (P(ct(1))  •  P(a(2))  •  •  •  P(a(n)))y>y,  - 

After  each  sweep,  the  next  sweep  continues  with  the  current  image  y  as  its  starting  position.  It  should  be  clear  from 
the  algorithm  that  at  the  end  of  each  sweep  each  pixel  i  has  a  non-zero  probability  to  get  into  any  of  its  states  from 
Xi  defined  by  the  embedding  operation  (because  D  is  bounded).  This  means  that  all  elements  of  y  will  be  visited  with 
positive  probability  and  thus  the  transition  probability  matrix  P(a)  corresponds  to  a  homogeneous  irreducible  Markov 
process  with  a  unique  left  eigenvector  corresponding  to  a  unit  eigenvalue  (unique  stationary  distribution).  Because  7T\  is 
a  left  eigenvector  corresponding  to  a  unit  eigenvalue  for  each  matrix  P(i),  it  is  also  a  left  eigenvector  for  P(<r)  and  thus  its 
stationary  distribution  due  to  its  uniqueness.  A  standard  result  from  the  theory  of  Markov  chains  (see,  e.g.  Chapter  4 
in  [92])  states  that,  for  an  irreducible  Markov  chain,  no  matter  what  distribution  of  embedding  changes  v  £  [0,oo)^  one 
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starts  with,  and  independently  of  the  visiting  schedule  cr,  with  increased  number  of  sweeps,  k ,  the  distribution  of  Gibbs 
samples  converges  in  norm  to  the  stationary  distribution  rcy. 

(6.36)  || v  (P (a))k  —  7T\||  — >  0  with  k  ->  oo 

exponentially  fast.  This  means  that  in  practice  one  can  obtain  a  sample  from  ir\  after  running  the  Gibbs  sampler  for 
a  sufficiently  long  time.21  The  visiting  schedule  can  be  randomized  in  each  sweep  as  long  as  each  pixel  has  a  non-zero 
probability  of  being  visited,  which  is  a  necessary  condition  for  convergence. 


6.4.2.  Simulating  optimal  embedding .  When  applied  to  steganography,  the  Gibbs  sampler  allows  the  sender  to  simulate  the 
effect  of  embedding  using  a  scheme  that  operates  on  the  bound.  It  is  interesting  that  this  can  be  done  for  any  distortion 
function  D  and  without  knowing  the  rate-distortion  bound.  This  is  because  the  local  characteristics  (6.24) 


(6.37) 


*\{Yi=yi\Y„i 


-  _  exp(-A£>(t/'y^)) 

Et.6x4  exp(-AD(tiy^<))  ’ 


do  not  require  computing  the  partition  function  Z( A).  One  does  need  to  know  the  parameter  A,  though. 

For  the  distortion-limited  sender  (6.5),  the  Gibbs  sampler  could  be  used  directly  to  determine  the  proper  value  of  A  in 
the  following  manner.  For  a  given  A,  it  is  known  (Theorem  5.1.4  in  [92])  that 


(6.38)  1  D(yU))  E*x  \D\  as  k  -»  oo 

*  j=i 

in  L*i  and  in  probability,  where  y^’)  is  the  image  obtained  after  the  jrth  sweep  of  the  Gibbs  sampler.  This  requires  running 
the  Gibbs  sampler  and  averaging  the  individual  distortions  for  a  sufficiently  long  time.  When  only  a  finite  number  of 
sweeps  is  allowed,  the  first  few  images  y  should  be  discarded  to  allow  the  Gibbs  sampler  to  converge  close  enough  to  n\. 
The  value  of  A  that  satisfies  E,x  [D]  =  De  can  be  determined,  for  example,  using  a  binary  search  over  A. 

To  find  A  for  the  payload- limited  sender  (6.4),  one  needs  to  evaluate  the  entropy  H(n\),  which  can  be  obtained  from 
[D]  using  the  method  of  thermodynamic  integration  [62].  From  (6.10)  and  (6.13),  one  obtains 

(6M> 

Therefore,  the  entropy  can  be  estimated  from  En x  [ D)  by  integrating  by  parts: 


(6.40) 


H(vx)  =  0)  + 


Ao 


The  value  of  A  that  satisfies  the  entropy  (payload)  constraint  can  be  again  obtained  using  a  binary  search.  Having  obtained 
the  expected  distortion  and  the  entropy  using  the  Gibbs  sampler  and  the  thermodynamic  integration,  the  rate-distortion 
bound  H(nx)1E1Tx [Z)]]  can  be  plotted  as  a  curve  parametrized  by  A. 

In  practice,  one  has  to  be  careful  when  using  (6.38),  since  no  practical  guidelines  exist  for  determining  a  sufficient 
number  of  sweeps  and  heuristic  criteria  are  often  used  [14,  92].  Although  the  convergence  to  tt\  is  exponential  in  the 
number  of  sweeps,  in  general  a  large  number  of  sweeps  may  be  needed  to  converge  close  enough.  Generally  speaking,  the 
stronger  the  dependencies  between  embedding  changes  the  more  sweeps  are  needed  by  the  Gibbs  sampler.  In  theory,  the 
convergence  of  MCMC  methods,  such  as  the  Gibbs  sampler,  may  also  slow  down  in  the  vicinity  of  “phase  transitions,” 
which  is  loosely  defined  here  as  sudden  changes  in  the  spatial  distribution  of  embedding  changes  when  only  slightly 
changing  the  payload  (or  distortion  bound). 

In  experiments  reported  later  in  this  section,  the  Gibbs  sampler  always  behaved  well  and  converged  fast.  This  is 
attributed  to  the  fact  that  the  dependencies  among  embedding  modifications  as  measured  using  our  distortion  functions 
are  rather  weak  and  limited  to  short  distances.  The  convergence,  however,  could  become  an  issue  for  other  types  of  cover 
sources  with  different  distortion  functions.  While  it  is  possible  to  compute  the  rate-distortion  bounds  and  simulate  optimal 
embedding  using  other  MCMC  algorithms,  such  as  the  Metropolis-Hastings  sampler  [92],  that  may  converge  faster  than 
the  Gibbs  sampler  and  can  exhibit  a  more  robust  behavior  in  practice,  it  is  not  clear  how  to  adopt  these  algorithms  for 
practical  embedding.  This  is  because  all  known  coding  methods  in  steganography  essentially  sample  from  a  distribution  of 
independent  symbols.  Thus,  the  Gibbs  sampler  comes  out  as  a  natural  choice  (Section  6.5)  because  it  works  by  updating 
individual  pixels,  which  is  exactly  the  effect  of  embedding  using  syndrome-trellis  codes  [25,  26]. 


21The  convergence  time  may  vary  significantly  depending  on  the  Gibbs  field  at  hand 
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FIGURE  6.1.  The  four-element  cross-neighborhood  and  the  tessellation  of  the  index  set  S  into  two  disjoint 
sublattices  Se  and  <S0. 


I 

—  +— 


FIGURE  6.2.  All  three  possible  cliques  for  the  cross-neighborhood. 


A  notable  alternative  to  the  Gibbs  sampler  and  the  thermodynamic  integration  for  computing  the  rate-distortion  bound 
is  the  Wang  -Landau  algorithm  [87]  that  estimates  the  so-called  density  of  stego  images  (density  of  states  in  statistical 
physics),  g(D),  defined  as  the  number  of  stego  images  y  with  distortion  (energy)  D.  The  partition  function  (and  thus, 
via  (6.11),  the  entropy)  and  the  expected  distortion  can  be  computed  from  g(D)  by  numerical  integration: 


(6.41) 

(6.42) 


Z{\)  =  Y  5(^)exp(-AD)A, 

D€V 


En,[D]  = 


-J-  £  Dg(D)  exp(— AD)A, 


where  V  =  {du  . . . ,  dnD},  d\  =  — K ,  dLnD  =  K,  d{  —  =  A  is  a  set  of  discrete  values  into  which  the  dynamic  range  of 

D,  [— K]  is  quantized. 

The  PI  notes  that  in  general  it  is  not  possible  to  determine  ahead  of  time  which  method  will  provide  satisfactory 
performance.  In  experiments  described  in  Section  6.7,  the  thermodynamic  integration  worked  very  well  and  provided 
results  identical  to  the  much  more  complex  Wang-Landau  algorithm. 

Note  that  computing  the  rate-distortion  bound  is  not  necessary  for  practical  embedding.  In  Section  6.5,  one  introduces 
a  special  form  of  the  distortion  in  terms  of  a  sum  over  local  potentials.  In  this  case,  both  types  of  optimal  senders 
can  be  simulated  using  algorithms  that  do  not  need  to  compute  A  in  the  fashion  described  above.  This  is  explained  in 
Sections  6.5.1  and  6.5.2. 


6.5.  Local  distortion  function.  Thanks  to  the  Gibbs  sampler,  one  can  simulate  the  impact  of  embedding  that  is 
optimal  in  the  sense  of  (6.4)  and  (6.6)  without  having  to  construct  a  specific  steganographic  scheme.  This  is  important 
for  steganography  design  as  one  can  test  the  effect  of  various  design  choices  and  parameters  and  then  implement  only  the 
most  promising  constructs.  However,  it  is  rather  difficult  to  design  near-optimal  schemes  for  a  general  D( y).  Fortunately, 
it  is  possible  to  give  the  distortion  function  a  specific  form  that  will  allow  constructing  practical  embedding  algorithms. 
It  will  be  assumed  that  D  is  a  sum  of  local  potentials  defined  on  small  groups  of  pixels  called  cliques.  This  local  form  of 
the  distortion  will  be  still  quite  general  to  capture  dependencies  among  embedding  changes  and  it  allows  construction  of 
a  large  spectrum  of  diverse  embedding  schemes  -  a  topic  left  for  Section  6.6. 

First,  the  PI  defines  a  neighborhood  system  as  a  collection  of  subsets  of  the  index  set  {17(1)  c  S\i  =  1, . . . ,  n}  satisfying 
i  77(2),  Vi  and  i  €  7]{j)  if  and  only  if  j  €  77(2).  The  elements  of  rj(i)  are  called  neighbors  of  pixel  i .  A  subset  c  C  S  is  a 
clique  if  each  pair  of  different  elements  from  c  are  neighbors.  The  set  of  all  cliques  will  be  denoted  C.  The  PI  does  not  use 
the  calligraphic  font  for  a  clique  even  though  it  is  a  set  (and  thus  deviate  here  from  the  established  convention)  to  comply 
with  a  well  established  notation  used  in  previous  art. 

In  this  section  and  in  Section  6.6,  it  will  be  necessary  to  address  pixels  by  their  two-dimensional  coordinates.  The  PI 
will  hus  be  switching  between  using  the  index  set  S  =  {1, . . .  ,n}  and  its  two-dimensional  equivalent  S  =  {(i,i)|l  <  i  < 
ni,  1  <  j  <n 2}  hoping  that  it  will  cause  no  confusion  for  the  reader. 

Example  13.  The  four-element  cross  neighborhood  of  pixel  Xij  consisting  of  with  a  proper 

treatment  at  the  boundary  forms  a  neighborhood  system  (see  Figure  6.1).  The  cliques  contain  either  a  single  pixel  (one- 
element)  cliques  {£t,j}  or  two  horizontally  or  vertically  neighboring  pixels,  (Figure  6.2).  No 

other  cliques  exist. 
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Figure  6.3.  The  eight-element  neighborhood  and  the  tessellation  of  the  index  set  S  into  four  disjoint 

sublattices  marked  with  four  different  symbols. 
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Figure  6.4.  All  possible  cliques  for  the  eight-element  neighborhood. 


Example  14.  The  eight-element  3x3  neighborhood  also  forms  a  neighborhood  system  (Figure  6.3).  The  cliques  are 
as  in  Example  13  as  well  as  all  cliques  containing  pairs  of  diagonally  neighboring  pixels,  {x^Xt+ij+i},  {xij,Xi_i  j+i}, 
three-pixel  cliques  forming  a  right-angle  triangle  (e.g.,  {xij,Xij+i,x*+ij}),  and  four-pixel  cliques  forming  a  2  x  2  square 
({xij,x*j+i,Xi+ij,x*+ij+i}^  (follow  Figure  6.4).  No  other  cliques  exist  for  this  neighborhood  system. 

Each  neighborhood  system  allows  tessellation  of  the  index  set  S  into  disjoint  subsets  (sublattices)  whose  union  is  the 
entire  set  S}  so  that  any  two  pixels  in  each  lattice  are  not  neighbors.  For  example,  for  the  cross-neighborhood  S  =  U«S0, 

where 


(6.43)  <SC  =  {(i,j)\i  +  j  is  even},  SQ  =  {(i,j)\i  +  j  is  odd}. 

For  the  eight-element  3x3  neighborhood,  there  are  four  sublattices,  S  =  \JabSab,  1  <  a,  b  <  2,  whose  structure 
resembles  the  Bayer  color  filter  array  commonly  used  in  digital  cameras  [29], 

(6.44)  Sab  =  {(a  +  2fc, 6  4-  2i)|l  <  a  +  2k  <  nx,  1  <  6  +  21  <  n2}. 

For  a  clique  c  €  C,  one  denotes  by  Kc(y)  the  local  potential,  which  is  an  arbitrary  bounded  function  that  depends  only 
on  the  values  of  y  in  the  clique  c,  V"c(y)  =  Vc(yc).  The  PI  reminds  that  Vc  may  also  depend  on  x  in  an  arbitrary  fashion. 
It  is  time  now  to  introduce  a  local  form  of  the  distortion  function  as 

(6-45)  D(  y)  =  £vc(yc). 

c€C 


The  important  fact  is  that  D  is  a  sum  of  functions  with  a  small  support.  Let  us  express  the  local  characteristics  (6.24)  in 
terms  of  this  newly-defined  form  (6,45): 


(6.46) 

(6.47) 


tta  (Yi  =  y'i\y~i)  = 


exp(-A  EcfcC  ^(yfr^O) 

exp(~"^  Sc€C  Vc(Uy~i)) 


(a)  exp(-A  Ec€c(i)  Vc{Vjy~i )) 

exP(~^  5Zcec(i)  vc(tiy~i)) 


where  C(i)  =  {c  £  C\i  Gc},t  =  l,...,n.  Equality  (a)  holds  because  Vc(tiy„i)  does  not  depend  on  U  for  cliques  c  ^  C(i) 
as  they  do  not  contain  the  ith  element.  Thus,  the  terms  Vc  for  such  cliques  cancel  from  (6.47).  This  has  a  profound 
impact  on  the  local  characteristics,  making  the  realization  of  Yi  independent  of  changes  made  outside  of  the  union  of 
cliques  containing  pixel  z  and  thus  outside  of  the  neighborhood  rj(i).  For  the  cross-neighborhood  system  from  Example  13, 
changes  made  to  pixels  belonging  to  the  sublattice  Se  do  not  interact  and  thus  the  Gibbs  sampler  can  be  parallelized  by 
first  updating  all  pixels  from  this  sublattice  in  parallel  and  then  updating  in  parallel  all  pixels  from  S<,.22 


^2The  Gibbs  random  field  described  by  the  joint  distribution  ^^(y)  with  distortion  (6.45)  becomes  a  Markov  random  field  on  the  same 
neighborhood  system.  This  follows  from  the  Hammersley-Clifibrd  theorem  [92], 
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Algorithm  2  One  sweep  of  a  Gibbs  sampler  for  embedding  m-bit  message  (payload-limited  sender). 

Require:  S  =  Si  U  . . .  U  S8  {mutually  disjoint  sublattices} 
l:  for  k  =  1  to  s  do 
2:  for  every  i  £  Sk  do 

3:  Use  (6.48)  to  calculate  cost  of  changing  y,  y'  6  Z» 

4:  end  for 

5:  Embed  m/s  bits  while  minimizing  ^2ieSk 

6:  Update  ysk  with  new  values  and  keep  y~sk  unchanged. 

7:  end  for 
8:  return  y 


The  possibility  to  update  all  pixels  in  each  sublattice  all  at  once  provides  a  recipe  for  constructing  practical  embedding 
schemes.  Assume  S  =  <SiU. .  .U Ss  with  mutually  disjoint  sublattices.  The  PI  first  describes  the  actions  of  a  payload-limited 
sender  (follow  the  pseudo-code  in  Algorithm  2). 

6.5.1.  Payload-limited  sender.  The  sender  divides  the  payload  of  m  bits  into  s  equal  parts  of  m/s  bit s,  computes  the  local 
distortions 

(6.48)  =  53  t/e(y<y~i) 

cec(i) 

for  pixels  t  G  <Si,  and  embeds  the  first  message  part  in  Si.  Then,  it  updates  the  local  distortions  of  all  pixels  from  S2 
and  embeds  the  second  part  in  £2,  updates  the  local  distortions  again,  embeds  the  next  part  in  <S3,  etc.  Because  the 
embedding  changes  in  each  sublattice  do  not  interact,  the  embedding  can  be  realized  as  discussed  in  Section  6.3.  After  all 
sublattices  are  processed  it  will  mean  that  one  embedding  sweep  was  completed.  By  repeating  these  embedding  sweeps,23 
the  resulting  modified  images  will  converge  to  a  sample  from  7 r*. 

The  embedding  in  sublattice  Sk  will  introduce  embedding  changes  with  probabilities  (6.15),  where  the  value  of  A*  is 
determined  by  the  individual  distortions  {pi(y *y~i)|*  €  Sk}  (6.48)  to  satisfy  the  payload  constraint  of  embedding  m/s  bits 
in  the  kth  sublattice  (again,  e.g.,  using  a  binary  search  for  A*).  Because  each  sublattice  extends  over  a  different  portion  of 
the  cover  image  while  splitting  the  payload  evenly  across  the  sublattices,  A k  may  slightly  vary  with  k  because  of  variations 
in  the  individual  distortions.  This  represents  a  deviation  from  the  Gibbs  sampler.  Fortunately,  the  sublattices  can  often 
be  chosen  so  that  the  image  does  not  differ  too  much  on  every  sublattice,  which  will  guarantee  that  the  sets  of  individual 
distortions  {pi(y,/y~i)|i  €  S*}  are  also  similar  across  the  sublattices.  Thus,  with  an  increased  number  of  sweeps,  A*  will 
converge  to  an  approximately  common  value  and  the  whole  process  represents  a  correct  version  of  the  Gibbs  sampler. 

In  binary  embedding  (I £  =  {arf^x-1^}),  note  that  the  two  distortions  p-0^(xf^y~i)  =  D(x[°  yv(i)),  = 

yv(i))  at  pixel  i  depend  on  the  current  pixel  values  in  its  neighborhood  17(1).  Therefore,  both  and  can 

be  non-zero  at  the  same  time  and  one  can  even  have  <  p|0).  It  is  the  neighborhood  of  i  that  ultimately  determines 
whether  or  not  it  is  beneficial  to  preserve  the  value  of  the  pixel! 


6,5.2.  Distortion-limited  sender.  A  similar  approach  can  be  used  to  implement  the  distortion-limited  sender  with  a  dis¬ 
tortion  limit  Z>€.  Consider  a  simulation  of  such  embedding  by  a  Gibbs  sampler  with  the  correct  A  (obtained  from  a  binary 
search  as  described  in  Section  6.4.2)  on  the  sublattice  Sk  C  S.  Assuming  again  that  all  sublattices  have  the  same  distortion 
properties,  the  distortion  obtained  from  cliques  containing  pixels  from  Sk  should  be  proportional  to  the  number  of  such 
cliques.  Formally, 


(6.49) 


E*x(Ysk\Y~sk)[D]  =  Dt 


[{eg CicnSfc  ^ 0}] 

|C| 


As  described  in  Algorithm  3,  the  sender  can  realize  this  by  embedding  as  many  bits  to  every  sublattice  as  possible  while 
achieving  the  distortion  (6.49).  Note  that  one  does  not  need  to  compute  the  partition  function  for  every  image  in  order  to 
realize  the  embedding.  Moreover,  in  practice  when  the  embedding  is  implemented  using  syndrome- trellis  codes  [26],  the 
search  for  the  correct  parameter  A,  as  described  in  Section  6.4.2,  is  not  needed  either  as  long  as  the  distortion  properties 
of  every  sublattice  are  the  same.  This  is  because  the  codes  need  the  local  distortion  p,(y'y~»)  (6.48)  at  each  lattice  pixel 
i  and  not  the  embedding  probabilities.  (This  eliminates  the  need  for  A.) 


23After  each  embedding  sweep,  at  each  pixel  the  previous  change  is  erased  and  the  pixel  is  reconsidered  again,  just  like  in  the  Gibbs  sampler. 
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Algorithm  3  One  sweep  of  a  Gibbs  sampler  for  a  distortion-limit  sender,  Enx  [D]  =  Dt. 

Require:  S  =  S\  U  . . .  U  Ss  {mutually  disjoint  sublattices} 

1:  for  k  =  1  to  s  do 
2:  for  every  i  €  Sk  do 

3:  Use  (6.48)  to  calculate  Cost  of  changing  yx  ->  y[  £ 

4:  end  for 

5:  Embed  mk  bits  while  Si  Pi{ViY~i)  =  Dcx\{c£  C|cnS*  ^  0}|/|C|. 

6:  Update  ysk  with  new  values  and  keep  y~sk  unchanged. 

7:  end  for 

8:  return  y  and  J2k  {stego  image  and  number  of  bits} 


The  issue  of  the  minimal  sufficient  number  of  embedding  sweeps  for  both  algorithms  needs  to  be  studied  specifically  for 
each  distortion  measure  (see  the  discussion  in  the  experimental  Section  6.7).  By  replacing  a  specific  practical  embedding 
method  with  a  simulator  of  optimal  embedding,  one  can  simulate  the  impact  of  optimal  algorithms  (for  both  senders) 
without  having  to  determine  the  value  of  the  parameter  A  as  described  in  Section  6.4.2.  It  is  still  necessary  to  compute 
A*  for  each  sublattice  Sk  to  obtain  the  probabilities  of  modifying  each  pixel  (6.15),  but  this  can  be  done  as  described  in 
Section  6.3  without  having  to  use  the  Gibbs  sampler  or  the  thermodynamic  integration. 

Finally,  the  PI  comments  on  how  to  handle  wet  pixels  within  this  framework.  Since  it  is  assumed  that  the  distortion 
is  bounded  (|D(y)|  <  K  for  all  y  €  3^),  wet  pixels  are  handled  by  forcing  Xi  =  {x<}.  Because  this  knowledge  may  not  be 
available  to  the  decoder  in  practice,  practical  coding  schemes  should  treat  them  either  by  setting  pi(yi)  =  oo  or  to  some 
large  constant  for  xjx  ^  xx  (for  details,  see  [26]). 

6.5.3.  Practical  limits  of  the  Gibbs  sampler .  Thanks  to  the  bounds  established  in  Section  6.1,  it  is  now  known  that  the 
maximal  payload  that  can  be  embedded  in  this  manner  is  the  entropy  of  rc\  (6.11).  Assuming  the  embedding  proceeds  on 
the  bound  for  the  individual  sublattices,  the  question  is  how  close  the  total  payload  embedded  in  the  image  is  to  H(7r\). 
Following  the  Gibbs  sampler,  the  configuration  of  the  stego  image  will  converge  to  a  sample  y  from  tt\.  Let  us  now  go 
through  one  more  sweep  with  y  *1  denoting  the  stego  image  before  starting  embedding  in  sublattice  Sk,  k  =s  1, . . . ,  s.  In 
each  sublattice,  the  following  payload  is  embedded: 

(6-50)  H(YSk\Y„sk  =y~]sj- 

The  following  result  from  information  theory  is  now  used:  For  any  random  variables  X\, . . . ,  XSl 

$ 

(6.51)  <#(*!,- ••,**), 

/c=l 

with  equality  only  when  all  variables  are  independent.24  Thus,  in  general 

(6.52)  H~( Y)  £  £  H(YSk \Y~Sk  =  y^J  <  ^ <Y)  = 

k= 1 

The  term  H~( Y)  is  recognized  as  the  erasure  entropy  [84,  85]  and  it  is  equal  to  the  conditional  entropy  H( Y^z+1  YJ>) 
(entropy  rate)  of  the  Markov  process  defined  by  our  Gibbs  sampler  (c.f.,  (6.35)),  where  Y ^  is  the  random  variable  obtained 
after  l  sweeps  of  the  Gibbs  sampler. 

The  erasure-entropy  inequality  (6.52)  means  that  the  embedding  scheme  will  be  suboptimal,  unable  to  embed  the 
maximal  payload  The  actual  loss  can  be  assessed  by  evaluating  the  entropy  of  e.g.,  using  the  algorithms 

described  in  Section  6.4.  An  example  of  such  comparison  is  presented  in  Section  6.7.3. 

The  last  remaining  issue  is  the  choice  of  the  potentials  Vc.  In  the  next  section,  the  PI  shows  one  example  where  Vc  are 
chosen  to  tie  the  principle  of  minimal  embedding  distortion  to  the  preservation  of  the  cover-source  model.  The  PI  also 
describes  a  specific  embedding  method  and  subjects  it  to  experiments  using  blind  steganalyzers. 


24For  k  —  2,  this  result  follows  immediately  from  H(X  1IX2)  +  H(X 2\X\)  =  H(X  1^X2)  —  /(X 1;  X2).  The  result  for  s  >  2  can  be  obtained 
by  induction  over  $. 
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6.6.  Practical  embedding  constructions.  Here,  a  practical  embedding  method  that  uses  the  theory  developed  so  far 
is  described.  First  and  foremost,  the  potentials  Vc  should  measure  the  detectability  of  embedding  changes.  There  is 
substantial  freedom  in  choosing  them  and  the  design  may  utilize  reasoning  based  on  theoretical  cover  source  models  as 
well  as  heuristics  stemming  from  experiments  using  blind  stcganalyzers.  The  proper  design  of  potentials  is  a  complicated 
subject  in  itself  and  is  beyond  the  scope  of  this  effort,  whose  main  purpose  was  to  introduce  a  general  framework  rather 
than  optimizing  the  design.  Here,  the  PI  describes  a  specific  example  of  a  more  general  approach  that  builds  upon  the 
latest  results  in  steganography  and  steganalysis  and  one  that  provided  an  opportunity  to  validate  the  proposed  framework 
by  showing  an  improvement  over  the  current  state  of  the  art  in  Section  6.7. 

6.6.1.  Additive  approximation .  As  argued  in  the  introduction,  the  steganography  design  principles  based  on  model  preser¬ 
vation  and  on  minimizing  distortion  coincide  when  the  distortion  is  defined  as  a  norm  of  the  difference  of  feature  vectors 
used  to  model  cover  images: 

(6.53)  D{y)  =  ||/(x)  -  /(y)||  =  £ ™fc!/*(x)  -  /*(y)J. 

k=l 

Here,  /(x)  =  (/i(x),...,/<f(x))  €  Rd  is  a  d-dimensional  feature  vector  of  image  x  and  w  =  are  weights. 

The  properties  of  D  defined  in  this  manner  depend  on  the  properties  of  the  functions  fk •  In  general,  however,  D  is  not 
additive.  In  the  past,  steganographers  were  forced  to  use  some  additive  approximation  of  D  to  realize  the  embedding  in 
practice.  A  general  method  for  turning  an  arbitrary  distortion  measure  into  an  additive  proceeds  is: 

n 

(6.54)  D(y)  =  £  D(Vi*~i)> 

1=1 

Embedding  with  the  additive  measure  D  can  be  simulated  (and  realized)  as  explained  in  Section  6.3.  The  approximation, 
of  course,  ensues  a  capacity  loss  due  to  a  mismatch  in  the  minimized  distortion  function.  Thanks  to  the  methods  introduced 
in  Section  6.4.2,  this  loss  can  now  be  contrasted  against  the  rate-distortion  bound  for  the  original  measure  D.  However, 
one  cannot  build  a  practical  scheme  unless  D  can  be  written  as  a  sum  of  local  potentials.  Next,  the  PI  explains  how  to 
turn  D  into  this  form  using  the  idea  of  a  bounding  distortion. 

6.6.2.  Bounding  distortion.  Most  features  used  in  steganalysis  can  be  written  as  a  sum  of  locally-supported  functions 
across  the  image 

(6-55)  A(x)  =  £/<fc>(x), 

e€C 

For  example,  the  kth  histogram  bin  of  image  x  can  be  written  using  the  Iverson  bracket  as 

(6.56)  ftfc(x)  =  ]£[x,-  =  fc], 

<65 

while  the  kith  element  of  a  horizontal  co-occurrence  matrix 

uj  na  — 1 

(6.57)  Cfc.i(x)  =  £  £ 1 =  *)[*<,;+!  = 

<=i  l 

is  a  sum  over  horizontally  adjacent  pixels  (horizontal  two-pixel  cliques).  For  such  locally-supported  features,  one  can 
obtain  an  upper  bound  on  D(y)  =*  ||/(x)  —  /(y)||,  y 

(6-58)  ll/(x)-/(y)||  = 

(6.59)  < 

(6.60) 

(6.61) 


€  y,  that  has  the  required  form: 

YWk  Hfck)(x)  - 

k=  1  cGC 

5>*£l/ifc)(x)-/<fc)(y)| 

k- 1  c€C 

££^|/C(fc)(x)-/W(y)| 
c€C  k=l 

Y  ^(y). 

e€C 
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where 


(6.62) 


Vc(y)  =  'tw«\fck)M-f(ck\y)\- 


k=l 


Following  our  convention  explained  in  Section  6.1,  the  PI  describes  the  methodology  for  a  fixed  cover  image  x  and  thus 
does  not  make  the  dependence  of  Vc  on  x  explicit.  The  sum  K(y)  will  be  called  the  bounding  distortion. 

The  PI  now  provides  a  specific  example  of  this  approach.  The  choice  is  motivated  by  the  desire  to  work  with  a  modern, 
well-established  feature  set  so  that  later,  in  Section  6.7,  one  can  validate  the  usefulness  of  the  proposed  framework 
by  constructing  a  high-capacity  steganographic  method  undetectable  using  current  state-of-the-art  steganalyzer.  The 
motivation  and  justification  of  the  feature  set  appears  in  [66].  It  is  a  slight  modification  of  the  SPAM  set  [64],  which  is  the 
basis  of  the  current  most  reliable  blind  steganalyzer  in  the  spatial  domain.  The  features  are  constructed  by  considering 
the  differences  between  neighboring  pixels  (e.g.,  horizontally  adjacent  pixels)  as  a  higher-order  Markov  chain  and  taking 
the  sample  joint  probability  matrix  (co-occurrence  matrix)  as  the  feature.  The  advantage  of  using  the  joint  matrix  instead 
of  the  transition  probability  matrix  is  that  the  norm  of  the  feature  difference  can  be  readily  upper-bounded  by  the  desired 
local  form  (6.62). 

To  formally  define  the  feature  for  an  ni  x  U2  image  x,  let  us  consider  the  following  co-occurrence  matrix  computed 
from  horizontal  pixel  differences  (x)  =  xtJ+i  -  xiyj,  i  =  1, . . . ,  ni,  j  =  1, . . . ,  ri2  —  1: 


(6.63) 


AkAx) 


l 

ni(n2-2) 


EEP.j’^1  )«  =  (M)]. 


t=i  j= i 


For  compactness,  in  (6.63)  the  argument  of  the  Iverson  bracket  is  abbreviated  from  D~j(x)  =  k  &  =  l  to 

(D”^,Z)”^+1)(x)  =  ( k,l ).  Clearly,  A^-(x)  is  the  normalized  count  of  neighboring  triples  of  pixels  2}  with 

differences  Xij+i  —  Xij  =  k  and  x*j+ 2  —  =  l  in  the  entire  image.  The  superscript  arrow  denotes  the  fact  that 

the  differences  are  computed  by  subtracting  the  left  pixel  from  the  right  one.  Similarly, 


(6.64) 


AtAx) 


1 

ni(n2  -  2) 


=  (m>] 


t=l  j= 3 


with  £)£~(x)  =  x»j_  1  —  Xij.  By  analogy,  one  can  define  vertical,  diagonal,  and  minor  diagonal  matrices  Ak  z,  A^  z, 
Akl)  A A^|.  All  eight  matrices  are  sample  joint  probabilities  of  observing  the  differences  k  and  /  between  three 
consecutive  pixels  along  a  certain  direction.  Due  to  the  antisymmetry  jD“^(x)  =  -  jD^+1(x)  only  Aj^,  A'k  ^  a£  A^  are 
needed  since  Aj^z  =  A^t  _k,  and  similarly  for  other  matrices. 

Because  neighboring  pixels  in  natural  images  are  strongly  dependent,  each  matrix  exhibits  a  sharp  peak  around  (&,  l)  = 
(0,0)  and  then  quickly  falls  off  with  increasing  k  and  l.  When  such  matrices  are  used  for  steganalysis  [64],  they  are 
truncated  to  a  small  range,  such  as  — T  <  fc,  l  <  T,  T  =  4,  to  prevent  the  onset  of  the  “curse  of  dimensionality.”  On  the 
other  hand,  in  steganographv  one  can  use  large-dimensional  models  (T  ==  255)  because  it  is  easier  to  preserve  a  model 
than  to  learn  it.25  Another  reason  for  using  a  high- dimensional  feature  space  is  to  avoid  “overtraining”  the  embedding 
algorithm  to  a  low-dimensional  model  as  such  algorithms  may  become  detectable  by  a  slightly  modified  feature  set,  an 
effect  already  reported  in  the  DCT  domain  [56]. 


^Similar  reasoning  for  constructing  the  distortion  function  was  used  in  the  HUGO  algorithm  [66]. 
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Figure  6.5.  The  union  of  all  12  cliques  consisting  of  three  pixels  arranged  in  a  straight  line  in  the  5x5 
square  neighborhood. 

By  embedding  a  message,  >lj^(x)  is  modified  to  A^iy).  The  differences  between  the  features  will  thus  serve  as  a 
measure  of  embedding  impact  closely  tied  to  the  model  (the  indices  i  and  j  run  from  1  to  n i  and  ri2  —  2,  respectively): 


=  (M)] 


(6.65) 

I4u(y)  -  am(x)I  = 

(6.66) 

=ni(„:  _2)|i:i(c«.d3« 

(6.67) 

-[(D-,D^+i)(x)  =  (fc,/)] 

(6.6S) 

-  n1(n2-2)?ll(D<J',D‘*J+1 

(6.69) 

-  ((Dy,Dij+I(x)  =  (M) ]| 

(6.70) 

=  £  H(ck^(y), 

cec-* 

where  the  PI  defined  the  following  locally-supported  functions 

0-Vy)  =  - - - 

c  (Y)  «i(n2  —  2) 

(6-71)  •  |[(D-.,D-+1)(y)  =  (k,l))  -  [(D-.,D-.+1)(x)  =  (M)]| 

on  all  horizontal  cliques  =  {c|c  =  {(t,  j),  (i,j  4  1),  (i,  j  4*  2)}}.  Notice  that  the  absolute  value  had  to  be  pulled  into 
the  sum  to  give  the  potentials  a  small  support. 

Since  the  other  three  matrices  can  be  written  in  this  manner  as  well,  one  can  write  the  distortion  function  in  the 
following  final  form 

(6.72)  D(  y)  =  2>e(y), 

c€C 

now  with  C  —  C~*  U  C S  U  CT  U  C\  the  set  of  three-pixel  cliques  along  all  four  directions,  and 

(6.73)  \4(y)  =  ^  Wk,iH(k'l>}-*( y),  for  each  clique  c  €  C“\ 

k,i 

and  similarly  for  the  other  three  clique  types.  Notice  that  there  are  again  weights  w/cj  >  0  in  the  definition  of  Vc  that  can 
be  adjusted  according  to  how  sensitive  steganalysis  is  to  the  individual  differences.  For  example,  if  a  certain  difference 
pair  (fc,Z)  varies  significantly  over  cover  images,  by  assigning  it  a  smaller  weight  one  can  allow  it  to  be  modified  more 
often,  while  those  differences  that  are  stable  across  covers  but  sensitive  to  embedding  should  be  intuitively  assigned  a 
larger  value  so  that  the  embedding  does  not  modify  them  too  much. 

To  complete  the  picture,  the  neighborhood  system  here  is  formed  by  5  x  5  neighborhoods  and  thus  the  index  set  can 
be  decomposed  into  nine  disjoint  sublattices  S  =  Uab<Sab ,  1  <  a,  5  <  3, 

(6.74)  Sab  ~  {(u  4  3/c,  6  4  3/),l  ^  cl  4  3  k  ^  tii ,  1  <64  3Z  <  712}* 

To  better  explain  the  effect  of  embedding  changes  on  the  distortion,  realize  that  each  pixel  belongs  to  three  horizontal, 
three  vertical,  three  diagonal,  and  three  minor-diagonal  cliques.  When  a  single  pixel  Xi.j  is  changed,  it  affects  only  the  12 
potentials  whose  clique  contains  Xifj.  For  example  if  the  original  pixel  values  Co  =  {£ij»£*tj+i,£ij+2}  had  differences  fc,Z, 
and  the  pixel  value  changed  from  Xij  to  y^j  =  Xij  4 1.  Then,  the  pixel  differences  will  be  modified  to  k  —  1, 1.  Considering 


FINAL  REPORT  FOR  CONTRACT  FA9550-08- 1-0084  ENTITLED  “TOWARDS  STATISTICALLY  UNDETECTABLE  STEGANOGRAPHY"  62 


- Bounding  dist 

- Additive  approx. 

.  Binary  ±1 

- Ternary  ±1 

°0  0.1  0.2  0.3  0.4  0.5 


Relative  payload  a  (bpp) 


Figure  6.6.  Comparison  of  ±1  embedding  with  optimal  binary  and  ternary  coding  with  binary  embed¬ 
ding  algorithms  based  on  the  Gibbs  construction  with  a  bounding  distortion  and  the  additive  approxima¬ 
tion  as  described  in  Section  6.7.1.  The  error  bars  depict  the  minimum  and  maximum  steganalyzer  error 
Pe  over  five  runs  of  SVM  classifiers  with  different  division  of  images  into  training  and  testing  set. 


just  the  contribution  from  to  the  potential  (6.73),  it  will  increase  by  the  sum  of  Wk,i  (the  pair  kj  is  leaving 

cover)  and  Wk-u  (a  new  pair  appears  in  the  stego  image). 

6.6.3.  Other  options.  The  framework  presented  in  this  report  allows  the  sender  to  formulate  the  local  potentials  directly 
instead  of  obtaining  them  as  the  bounding  distortion.  For  example,  the  cliques  and  their  potentials  may  be  determined 
by  the  local  image  content  or  by  learning  the  cover  source  using  the  method  of  fields  of  experts  [70].  The  merit  of  these 
possibilities  can  be  evaluated  by  steganalyzers  trained  on  a  large  set  of  images.  The  important  question  of  optimizing  the 
local  potential  functions  w.r.t.  statistical  detectability  is  an  important  direction  the  PI  intends  to  explore  in  the  future. 

6.7.  Experiments.  In  this  section,  the  PI  validates  the  proposed  framework  experimentally  and  includes  a  comparison 
between  simple  steganographic  algorithms,  such  as  binary  and  ternary  ±1  embedding  and  steganography  implemented 
via  the  bounding  distortion  and  the  additive  approximation  (6.54).  For  the  case  of  the  bounding  distortion,  the  capacity 
loss  w.r.t.  the  optimal  payload  given  by  H( 7T\)  is  evaluated  by  means  of  the  thermodynamic  integration  algorithm  from 
Section  6.4.2. 

6.7.1.  Tested  embedding  methods.  For  the  methods  based  on  additive  approximation  and  the  bounding  distortion,  the  PI 
used  as  a  feature  vector  the  joint  probability  matrix  m(x)  defined  similarly  as  in  (6.63)  with  the  difference  vector 
computed  from  four  consecutive  pixels  D^+2)  =  (k,l,m).  As  above,  four  such  matrices  corresponding  to 

four  spatial  directions  were  computed.  The  matrices  were  used  at  their  full  size  T  =  255  leading  to  model  dimensionality 
d  =  4x  5113  «  5  •  108. 

The  weights  were  chosen  to  be  small  for  those  triples  {Djj,  =  tn)  that  occur  infrequently  in  images 

and  large  for  frequented  triples.  Following  the  recommendation  described  in  [66],  since  the  frequency  of  occurrence  of  the 
triples  falls  off  quickly  with  their  norm,  the  weights  are  chosen  as 

(6.75)  Wk,i,m  =  (cr  4*  \A2  -f  l2  +  m2j  , 

with  0  =  1  and  cr  =  1.  The  purpose  of  the  weights  is  to  force  the  embedding  algorithm  to  modify  those  parts  of  the  model 
that  are  difficult  to  model  accurately,  forcing  thus  the  steganalyst  to  use  a  more  accurate  model.  Here,  the  advantage  goes 
to  the  steganographer,  because  preserving  a  high-dimensional  feature  vector  is  more  feasible  than  accurately  modeling  it. 
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Because  the  neighborhood  Tj(i)  in  this  case  contains  7x7  pixels,  the  image  was  divided  into  16  square  sublattices  on 
which  embedding  was  carried  out  independently.  The  PI  tested  binary  embedding,  2*  =  {xt,x'},  where  x\  was  selected 
randomly  and  uniformly  from  {x*  —  1,  X{  + 1}  and  then  fixed  for  all  experiments  with  cover  x.  The  payload-limited  sender 
was  simulated  using  the  Gibbs  sampler  constrained  to  only  two  sw'eeps.  Increasing  the  number  of  sweeps  did  not  lead  to 
further  improvement.  The  curiously  low'  number  of  sweeps  sufficient  to  properly  implement  the  Gibbs  sampler  is  most 
likely  due  to  the  fact  that  the  dependencies  dictated  by  the  bounding  distortion  are  rather  weak.  The  simulation  of 
embedding  for  one  image  took  less  than  5  seconds  when  implemented  in  C++. on  a  single-processor  PC. 

To  summarize,  the  foliowring  four  steganographic  methods  were  tested: 

(1)  Binary  embedding  using  the  Gibbs  construction  with  sets  2*  =  {x*,xj}  and  bounding  distortion  (6.72)  of  (6.53) 
with  weights  (6.75)  for  the  d  =  4  x  5113-dimensional  feature  space  given  by  matrices  m,  A£t  9  Aj  j  m,  A^l  m. 

(2)  Additive  approximation  (6.54)  of  (6.53)  for  the  same  sets  7+ ,  feature  space,  and  norm  as  in  1), 

(3)  Binary  ±1  embedding  with  the  same  sets  2*  equipped  with  a  matrix  embedding  scheme  operating  on  the  binary 
bound. 

(4)  Ternary  ±1  embedding  with  2*  =  {x*  -  l,x,,xt  +  1}  equipped  with  a  ternary  matrix  embedding  scheme  operating 
on  the  ternary  bound  (the  bounds  appear,  e.g.,  in  [29]). 

Practical  near-optimal  codes  for  the  two  ±1  embedding  methods  can  be  found  in  [31]  and  [98]. 

6.7.2.  Testing  methodology  and  final  results.  Following  the  separation  principle,  the  PI  now  studies  the  security  of  all 
schemes  when  operating  on  the  rate-distortion  bound.  All  tests  were  carried  out  on  the  BOWS2  database  [2]  containing 
approximately  10800  grayscale  images  with  a  fixed  size  of  512  x  512  pixels  coming  from  rescaled  and  cropped  natural 
images  of  various  sizes.  Steganalysis  was  implemented  using  the  second-order  SPAM  feature  set  with  T  =  3  [64].  The 
image  database  wa s  evenly  divided  into  a  training  and  a  testing  set  of  cover  and  stego  images,  respectively.  A  soft- 
margin  support-vector  machine  was  trained  using  the  Gaussian  kernel.  The  kernel  width  and  the  penalty  parameter  were 
determined  using  five-fold  cross  validation  on  the  grid  (C,7)  G  {(10fc,  2*)\k  G  {— 3, . . . , 4},  j  G  {— L  —  3, . . .  T  — L  4*  3}}, 
where  L  =  log2  d  is  the  binary  logarithm  of  the  number  of  features. 

The  results  are  reported  using  the  minimum  average  classification  error  Pe-  Smaller  values  of  Pe  correspond  to  better 
steganalysis  and  thus  larger  statistical  detectability  (lower  security). 

Figure  6.6  displays  the  comparison  of  all  four  embedding  methods  listed  above.  The  methods  based  on  the  the 
bounding  distortion  and  the  additive  approximation  (denoted  as  “Bounding  dist.”  and  “Additive  approx”)  are  completely 
undetectable  for  payloads  smaller  than  0.15  bpp,  which  suggests  that  the  embedding  changes  are  made  in  pixels  not 
covered  by  the  SPAM  features.  Since  both  schemes  are  binary  with  2*  =  {x*,xj}  with  x\  randomly  chosen  from  {x*  — 
l,x<  +  1},  they  become  equivalent  to  simple  binary  ±1  embedding  (Method  3)  as  a  ->  1  and  thus  become  detectable. 
Comparing  the  capacity,  both  schemes  allow  communicating  ten  times  larger  payloads  with  Pe  =  40%  as  compared 
to  ternary  ±1  embedding.  The  advantage  of  using  the  Gibbs  sampler  with  the  bounding  distortion  over  the  additive 
approximation  becomes  more  evident  for  larger  payloads,  where  the  embedding  changes  start  to  interact.  This  confirms 
the  expectation  that  in  this  range  the  additive  approximation  is  unable  to  cope  with  the  interactions  among  changes  and 
thus  its  detectability  increases.  This  result,  however,  may  change  for  different  distortion  measures  and  cover  sources. 
The  fact  that  the  Gibbs  sampler  with  bounding  distortion  did  not  bring  a  substantial  performance  improvement  over  the 
additive  approximation  indicates  that  the  interactions  among  embedding  changes  are  in  general  quite  weak  (at  least  as 
far  as  they  are  captured  by  the  bounding  distortion).  The  low  strength  of  interactions  also  explains  why  only  two  sweeps 
of  the  Gibbs  sampler  were  sufficient  in  practice. 

6.7.3.  Analysis  of  upper  bounds .  As  described  in  Section  6.5.3,  Algorithm  2  for  the  payload-limited  sender  is  unable  to 

embed  the  optimal  payload  of  H( n\)  for  three  reasons.  The  performance  may  be  affected  by  the  small  number  of  sweeps 
of  the  Gibbs  sampler,  the  parameter  A  may  vary  slightly  among  the  sublattices,  and  the  algorithm  embeds  the  erasure 
entropy  The  combined  effect  of  these  factors  is  of  great  importance  for  practitioners  and  is  evaluated 

below  for  trwo  images  using  the  Gibbs  sampler  and  the  thermodynamic  integration  as  explained  in  Section  6.4.2. 

Since  the  Gibbs  construction  depends  on  the  cover  image  x,  the  PI  present  the  results  for  two  grayscale  images  of 
size  512  x  512  pixels  coming  from  two  different  sources.  The  test  image  “O.png”  is  from  the  BOWS2  database  and 
“Lenna”  wras  obtained  from  http://en.wikipedia.Org/wiki/File:Lenna.png  and  converted  to  grayscale  using  GNU 
Image  Manipulation  Program  (GIMP).  In  both  cases,  the  PI  used  the  same  sets  2*  and  the  same  feature  set  as  in  the 
previous  section  with  the  bounding  distortion  with  weight  parameters  <7  =  1  and  0  =  1. 

The  image  “O.png”  contains  more  areas  with  edges  and  textures  than  “Lenna”  and  thus  for  small  distortions,  it  offers 
a  larger  capacity  than  “Lenna”  because  the  weights  (6.75)  around  edg6s  and  complex  texture  are  small.  This  is  apparent 
from  the  slopes  of  the  rate-distortion  bounds  in  Figure  6.7. 
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Figure  6.7.  Comparison  of  the  payload  loss  of  Algorithm  2  for  cover  images  “O.png”  and  “Lenna” 
shown  on  the  right.  The  rate-distortion  bounds  were  obtained  using  the  Gibbs  sampler  (6.38)  and  the 
thermodynamic  integration  (6.40). 


The  same  figure  compares  the  rate-distortion  performance  of  the  payload-limited  sender  simulated  by  the  Gibbs  sampler 
with  only  two  sweeps  as  described  in  Algorithm  2.  For  a  given  payload,  the  distortion  was  obtained  as  an  average  over 
100  random  messages.  The  comparison  shows  that  the  payload  loss  of  Algorithm  2  to  the  optimal  H(n\)  is  quite  small. 
Note  that  the  erasure  entropy,  plotted  in  the  figure  has  been  computed  over  the  sublattices  after  two  sweeps  and 

thus  already  contains  the  impact  of  all  three  factors  discussed  at  the  beginning  of  this  section. 

6.8.  Discussion.  Currently,  the  most  successful  principle  for  designing  practical  steganographic  systems  that  embed  in 
empirical  covers  is  based  on  minimizing  a  suitably  defined  distortion  measure.  Implementation  difficulties  and  a  lack  of 
practical  embedding  methods  have  so  far  limited  the  application  of  this  principle  to  a  rather  special  class  of  distortion 
measures  that  are  additive  over  pixels.  With  the  development  of  near-optimal  low-complexity  coding  schemes,  such  as 
the  syndrome-trellis  codes  [26],  this  direction  has  essentially  reached  its  limits.  It  is  a  firm  belief  of  the  PI  that  further 
substantial  increase  in  secure  payload  is  possible  only  when  the  sender  uses  adaptive  schemes  that  place  embedding  changes 
based  on  the  local  content,  that  dare  to  modify  pixels  in  some  regions  by  more  than  1,  and  that  consider  interactions 
among  embedding  changes  while  preserving  higher-order  statistics  among  pixels.  This  section  describes  a  contribution 
which  is  an  important  step  in  this  direction. 

It  offers  the  steganographer  a  complete  methodology  for  embedding  while  minimizing  an  arbitrarily  defined  distortion 
measure  D.  The  absence  of  any  restrictions  on  D  means  that  the  remaining  task  left  to  the  sender  is  to  find  a  distortion 
measure  that  correlates  with  statistical  detectability.  An  appealing  possibility  is  to  define  D  as  a  weighted  norm  of 
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the  difference  between  cover  and  stego  feature  vectors  used  in  steganalysis.  This  immediately  connects  the  principle  of 
minimum-distortion  steganography  with  the  concept  of  model  preservation  which  has  so  far  been  limited  to  low-dimensional 
models.  Being  able  to  preserve  a  large-dimensional  model  gives  the  steganographer  a  great  advantage  over  the  steganalyst 
because  of  the  difficulties  associated  with  learning  a  high-dimensional  cover  source  model  using  statistical  learning  tools. 

The  proposed  framework  is  called  the  Gibbs  construction  and  it  connects  steganography  with  statistical  physics,  which 
contributed  with  many  practical  algorithms.  In  particular,  the  Gibbs  sampler  combined  with  the  thermodynamic  integra¬ 
tion  can  be  used  to  derive  the  rate-distortion  bound,  simulate  the  impact  of  optimal  embedding,  and  realize  near-optimal 
embedding  algorithms.  These  three  tasks  can  be  addressed  separately  (the  so-called  “separation  principle”)  giving  the 
sender  a  great  amount  of  design  flexibility  as  well  as  control  over  losses  of  practical  schemes. 

An  important  case  elaborated  in  this  section  corresponds  to  D  defined  as  a  sum  of  local  potentials  over  small  pixel 
neighborhoods.  Here,  the  optimal  distribution  of  embedding  modifications  reduces  to  a  Markov  random  field  and  the  Gibbs 
sampler  can  be  turned  into  a  practical  embedding  algorithm  able  to  consider  dependencies  among  embedding  changes. 
When  D  cannot  be  written  as  a  sum  of  local  potentials,  practical  (suboptimal)  methods  can  be  realized  by  approximating 
D  either  with  an  additive  distortion  measure  or  with  local  potentials.  The  problem  of  finding  the  best  approximation  for 
a  given  non-local  D  is  of  its  own  interest.  The  PI  did  not  cover  the  task  of  minimizing  the  statistical  detectability  with 
respect  to  the  distortion  function  completely  due  to  its  inherent  complexity;  it  is  left  as  part  of  future  effort. 

The  proposed  methodology  was  described  both  for  a  payload-limited  sender  and  the  distortion-limited  sender.  The 
former  embeds  a  fixed  payload  in  every  image  with  minimal  distortion,  while  the  latter  embeds  the  maximal  payload 
for  a  given  distortion  in  every  image.  The  distortion-limited  sender  better  corresponds  to  the  intuition  that,  for  a  fixed 
statistical  detectability,  more  textured  or  noisy  images  can  carry  a  larger  secure  payload  than  smoother  or  simpler  images. 
The  fact  that  the  size  of  the  hidden  message  is  driven  by  the  cover  image  essentially  represents  a  more  realistic  case  of 
the  batch  steganography  paradigm  [50]. 

Note  that  the  distortion  measure  is  used  only  by  the  sender  and  thus  does  not  need  to  be  shared.  The  only  information 
needed  by  the  receiver  to  decode  the  message  is  its  size  which  can  be  communicated  separately  in  the  same  cover  image. 
Tins  opens  up  the  intriguing  possibility  to  develop  embedding  schemes  able  to  learn  the  proper  distortion  function  while 
observing  the  impact  of  embedding  on  the  cover  source. 

Finally,  the  proposed  methodology  can  be  applied  to  other  data  hiding  problems  where  the  statistical  detectability 
constraint  could  be  replaced  by  a  perceptual  distortion  constraint.  The  implicit  premise  of  this  work  is  the  direct  rela¬ 
tionship  between  the  distortion  function  D  and  statistical  detectability.  Designing  (and  possibly  learning)  the  distortion 
measure  for  a  given  cover  source  is  an  interesting  problem  by  itself  and  was  addressed  by  the  PI  in  her  last  publication  [24] 
developed  under  this  effort  and  made  possible  by  the  No-Cost  Extension.  C-W-  implementation  with  Matlab  wrappers  of 
STCs  and  multi-layered  STCs  are  available  at  http://dde.binghainton.edu/download/syndrome/. 
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